Full Notes

High Performance Computer
Architecture
Text Book: Computer Architecture: A Quantitative
Approach by Hennessey and Patterson
Dr. A.K. Sahoo

School of Comp. Engg.
KIIT University
Bhubaneswar
Fundamentals of Computer Design
• Today’s personal computer costs less than
$500 that has more performance, more main
memory, and more disk storage than a
computer bought in 1985 for 1 million dollars.
• Major contributions: technological
improvements and better computer
architectures delivering performance
improvement of about 25% per year.
• the emergence of the microprocessor in the
late 1970s saw roughly 35% growth per year in
performance.
• mass-produced microprocessor, virtual
elimination of assembly language
programming reduced the need for object-
code compatibility.
• Second, the creation of standardized, vendor-
independent operating systems,
• such as UNIX and its clone, Linux, lowered the
cost and risk of bringing out a new
architecture.
• The RISC-based machines (1980s) focused the
attention of designers on two critical
performance techniques, the exploitation of
• Instructionlevel parallelism
(initially through pipelining and later through multiple
instruction issue) and
• the use of caches
• (initially in simple forms and later using more
sophisticated organizations and optimizations)
Single Processor Performance
Move to multi-processor
RISC
Single Processor Performance
10000
Move to multi-processor
??%/year
1000
Performance (vs. VAX-11/780)
52%/year
100
RISC
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Contd…
• Figure 1.1 shows that the combination of

architectural and organizational
enhancements led to 16 years of sustained growth in

performance at an annual rate of over 50%.
• A rate that is unprecedented in the computer

industry.
• However, Figure 1.1 also shows that this 16-
year renaissance is over.
• Since 2002, processor performance improvement has

dropped to about 22% per year.
• Due to the triple hurdles of maximum power

dissipation of air-cooled chips,
Little instruction-level parallelism left to exploit
efficiently, and almost unchanged memory latency.
Contd…
• This signals a historic switch from relying solely on
instruction level parallelism (ILP),
to
• Thread-level parallelism (TLP) and

• Data-level parallelism (DLP)
Defining Computer Architecture
• The task the computer designer faces is a
complex one: Determining important
attributes to design a new computer to
maximize performance while staying within
cost, power, and availability constraints.
• This task has many aspects, including
instruction set design, functional organization,
logic design, and implementation.
Contd…
• The implementation may encompass
integrated circuit design, packaging, power,
and cooling.
• Optimizing the design requires familiarity
with a very wide range of technologies, from
compilers and operating systems to logic
design and packaging.
Instruction Set Architecture
• The ISA (80x86, MIPS) serves as the boundary
between the software and hardware.
Seven dimensions of an ISA
1. Class of ISA: Nearly all ISAs today are classified
as general-purpose register architectures, the
operands are either registers or memory
locations.
• The 80x86 has 16 general-purpose and 16
registers
Contd…
• MIPS has 32 general-purpose and 32 floating-
point registers
• The two popular versions of this class are
register-memory (80x86), which can access
memory as part of many instructions,
and
• load-store (MIPS), which can access memory
only with load or store instructions.
• All recent ISAs are load-store.
Contd…
2. Memory addressing
• Virtually all desktop and server computers,
including the 80x86 and MIPS, use byte
addressing to access memory operands.
• MIPS, require that objects must be aligned .
• The 80x86 does not require alignment, but
accesses are generally faster if operands are
aligned.
• An access to an object of size s bytes at byte
address A is aligned if A mod s = 0.
Contd…
3. Addressing modes
• MIPS addressing modes are Register,
Immediate (for constants), and Displacement,
where a constant offset is added to a register
to form the memory address.
• The 80x86 supports those three plus three

variations of displacement: register indirect,
indexed, and based with scaled index.
Contd…
4. Types and sizes of operands
• Like most ISAs, MIPS and 80x86 support
operand sizes of 8-bit (ASCII character), 16-bit
(half word), 32-bit (integer or word), 64-bit
(double word or long integer), and IEEE 754
floating point in 32-bit (single precision) and
64-bit (double precision).
• The 80x86 also supports 80-bit floating point
(extended double precision).
Contd…
5. Operations
• The general categories of operations are data
transfer, arithmetic logical, control and floating
point.
• MIPS is a simple and easy-to-pipeline instruction
set architecture, and it is representative of the
RISC architectures.
• The 80x86 has a much richer and larger set of
operations.
Contd…
6. Control flow instructions
• Virtually all ISAs, including 80x86 and MIPS,
support conditional branches, unconditional
jumps, procedure calls, and returns.
• Both use PC-relative addressing, where the
branch address is specified by an address field
that is added to the PC.
• MIPS conditional branches ( BE,BNE, etc.),
while the 80x86 ranches (JE, JNE, etc.)
Contd…
7. Encoding an ISA
• There are two basic choices on encoding: fixed
length and variable length .
• All MIPS instructions are 32 bits long, which
simplifies instruction decoding. Figure shows
the MIPS instruction formats.
• The 80x86 encoding is variable length, ranging
from 1 to 18 bytes. Variable length
instructions can take less space than fixed-
length instructions.
Contd…
Contd…
Flynn’s Classification
SISD (Single Instruction Single Data):
Uniprocessors.
MISD (Multiple Instruction Single Data):
No practical examples exist
SIMD (Single Instruction Multiple Data):
Specialized processors(Vector architectures,
Multimedia extensions, Graphics processor units)
MIMD (Multiple Instruction Multiple Data):

General purpose, commercially important
(Tightly-coupled MIMD, Loosely-coupled MIMD)
SISD
IS DS Memory
Process
Control Module
ing
unit
Unit
SIMD
Processing DS1 Memory
Unit 1 Module
IS DS 2 Memory
Control Processing
unit Unit 2 Module
DS n Memory
Processing
Unit n Module
IS
MIMD
Contr IS Process DS1 Memory
ol ing Module
unit Unit 1
Contr IS Process DS2 Memory

ol ing Module
unit Unit 2
DSn Memory
Contr IS Process
ol ing Module
unit Unit n
Pipelining: Basic and Intermediate
Concepts
RISC Instruction Set Basics
(from Hennessey and Patterson)
• Properties of RISC architectures:
– All ops on data apply to data in registers and typically
change the entire register (32-bits or 64-bits).
– The only ops that affect memory are load/store
operations. Memory to register, and register to
memory.
– Load and store ops on data less than a full size of a
register (32, 16, 8 bits) are often available.
– Usually instructions are few in number (this can be
relative) and are typically one size.
Types Of Instructions
• ALU Instructions:
• Arithmetic operations, either take two registers as
operands or take one register and a sign extended
immediate value as an operand. The result is stored in
a third register.
• Logical operations AND OR, XOR do not usually
differentiate between 32-bit and 64-bit.
• Load/Store Instructions:
• Usually take a register (base register) as an operand
and a 16-bit immediate value. The sum of the two will
create the effective address. A second register acts as a
source in the case of a load operation.
Types Of Instructions (continued)
• In the case of a store operation the second register
contains the data to be stored.
• Branches and Jumps
• Conditional branches are transfers of control. As
described before, a branch causes an immediate value
to be added to the current program counter.
RISC Instruction Set Implementation
• We first need to look at how instructions in the
MIPS64 instruction set are implemented without
pipelining. Assume that any instruction (MIPS) can be
executed in at most 5 clock cycles.
• The five clock cycles will be broken up into the
following steps:
• Instruction Fetch Cycle
• Instruction Decode/Register Fetch Cycle
• Execution Cycle
• Memory Access Cycle
• Write-Back Cycle
Instruction cycle
Instruction Fetch (IF) Cycle
• Send the program counter (PC) to memory
and fetch the current instruction from
memory.
• Update the PC to the next sequential PC by
adding 4 (since each instruction is 4 bytes) to
the PC.
Instruction Decode (ID)/Register Fetch Cycle
• Decode the instruction and at the same time
read in the values of the register involved. As the
registers are being read, do equality test incase
the instruction decodes as a branch or jump.
• The offset field of the instruction is sign-
extended incase it is needed. The possible branch
effective address is computed by adding the sign-
extended offset to the incremented PC. The
branch can be completed at this stage if the
equality test is true and the instruction decoded
as a branch.
Instruction Decode (ID)/Register Fetch
Cycle (continued)
• Instruction can be decoded in parallel with

reading the registers because the register
addresses are at fixed locations.
Execution (EX)/Effective Address
Cycle
• If a branch or jump did not occur in the
previous cycle, the arithmetic logic unit (ALU)
can execute the instruction.
• At this point the instruction falls into three
different types:
• Memory Reference: ALU adds the base register and
the offset to form the effective address.
• Register-Register: ALU performs the arithmetic,
logical, etc… operation as per the opcode.
• Register-Immediate: ALU performs operation based on
the register and the immediate value (sign extended).
Memory Access (MEM) Cycle
• If a load, the effective address computed from
the previous cycle is referenced and the
memory is read. The actual data transfer to
the register does not occur until the next
cycle.
• If a store, the data from the register is written
to the effective address in memory.
Write-Back (WB) Cycle
• Occurs with Register-Register ALU instructions
or load instructions.
• Simple operation whether the operation is a
register-register operation or a memory load
operation, the resulting data is written to the
appropriate register into the register file.
Contd…
What Is A Pipeline?
• Pipelining is used by virtually all modern
microprocessors to enhance performance by
overlapping the execution of instructions.
• A common analogue for a pipeline is a factory
assembly line. Assume that there are three stages:
• Welding
• Painting
• Polishing
• For simplicity, assume that each task takes one
hour.
What Is A Pipeline?
• If a single person were to work on the product it
would take three hours to produce one product.
• If we had three people, one person could work
on each stage, upon completing their stage they
could pass their product on to the next person
(since each stage takes one hour there will be
no waiting).
• We could then produce one product per hour
assuming the assembly line has been filled.
What Is A Pipeline?
Pipelining: is an implementation technique
whereby multiple instructions are overlapped
in execution.
• It takes advantage of parallelism that exists
among the actions needed to execute an
instruction.
• Pipelining is the key implementation
technique used to make fast CPUs.
Characteristics Of Pipelining
• If the stages are perfectly balanced, then the
time per instruction on the pipelined
processor (assuming ideal conditions)—is
equal to
• Under these conditions, the speedup from

pipelining equals the number of pipe stages.
Contd…
• Usually, however, the stages will not be
perfectly balanced; furthermore, pipelining
does involve some overhead.
• The previous expression is ideal. We will see

later that there are many ways in which a
pipeline cannot function in a perfectly
balanced fashion.
Characteristics Of Pipelining
• In terms of a CPU, the implementation of
pipelining has the effect of reducing the
average instruction time, therefore reducing
the average CPI.
• EX: If each instruction in a microprocessor

takes 5 clock cycles (unpipelined) and we have
a 4 stage pipeline, the ideal average CPI with
the pipeline will be 1.25 .
Serial Vs Pipeline
Pipelined Execution
Time
IFetch Dcd Exec Mem WB
Program Flow IFetch Dcd Exec Mem WB

Precedence relation
A set of subtask { T1,T2,……,Tn } for a given task T, that
some task Tj can not start until some earlier task
Ti ,where (i<j)finishes.
Pipeline consists of cascade of processing stages.
Stages are combinational circuits over data stream
flowing through pipe.
Stages are separated by high speed interface latches
(Holding intermediate results between stages.)
Control must be under a common clock.
Pipeline Cycle
Pipeline cycle:
Determined by the time required by the
slowest stage.
Pipeline designers try to balance the length
(i.e. the processing time) of each pipeline
stage.
For a perfectly balanced pipeline, the
execution time per instruction is t/n, where
t is the execution time per instruction on
nonpipelined machine and n is the number
of pipe stages.
Pipeline Cycle
However, it is very difficult to make the
different pipeline stages perfectly balanced.
Besides, pipelining itself involves some

overhead.
Synchronous Pipeline
- Transfers between stages are simultaneous.
- One task or operation enters the pipeline per
cycle.
L L L L L
Input Output
S1 S2 Sk
Clock
 m d
Asynchronous Pipeline
- Transfers performed when individual stages
are ready.
- Handshaking protocol between processors.
Input Output
Ready S1 Ready S2 Ready Sk Ready

Ack Ack Ack Ack
- Different amounts of delay may be experienced at

different stages.
- Can display variable throughput rate.

A Few Pipeline Concepts
Si Si+1
 m d
Pipeline cycle : 
Latch delay : d
 = max {m } +
d
Pipeline frequency : f
f = 1 / 
Example on Clock period
Suppose the time delays of the 4 stages are 1 =
60ns,2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of tld = 10ns.
Hence the cycle time of this pipeline can be granted
to be like :-  = 90 + 10 =100ns
Clock frequency of the pipeline (f) = 1/100 =10 Mhz
If it is non-pipeline then = 60 + 50 + 90 + 80
=280ns
  = max {m } + d
Ideal Pipeline Speedup
k-stage pipeline processes n tasks in k + (n-
1) clock cycles:
k cycles for
the first task and n-1 cycles for the remaining n-
1 tasks.
Total time to process n tasks,
Tk = [ k +
(n-1)] 
For the non-pipelined processor
Pipeline Speedup Expression
Speedup=
T nk nk
Sk = 1
Tk =[ k + (n-1)]  = k + (n-1)
Maximum speedup = Sk  K ,for n >> K

Observe that the memory bandwidth
must increase by a factor of Sk:
Otherwise, the processor would stall waiting
for data to arrive from memory.

Efficiency of pipeline
The percentage of busy time-space span over
the total time span.
n:- no. of task or instruction
k:- no. of pipeline stages
:- clock period of pipeline
Hence pipeline efficiency can be defined by:-
n * k *n
 = =
K[ k* +(n-1)] k+(n-1)
Throughput of pipeline
Number of result task that can be completed by a
pipeline per unit time.
n n 
W = = =
k*+(n-1) [k+(n-1)]
Idle case w = 1/ = f when  =1.
Maximum throughput = frequency of linear pipeline

Pipelines: A Few Basic Concepts
Historically, there are two different types of pipelines:
Instruction pipelines
Arithmetic pipelines
Arithmetic pipelines (e.g. FP multiplication) are not
popular in general purpose computers:
Need a continuous stream of arithmetic
operations.
E.g. Vector processors operating on an array.
On the other had instruction pipelines being used in
almost every modern processor.
Pipeline increases instruction throughput:
But, does not decrease the execution time of the
individual instructions.
In fact, slightly increases execution time of each
instruction due to pipeline overheads.
Pipeline overhead arises due to a combination
of:
Pipeline register delay
Clock skew
Pipeline register delay:
Caused due to set up time
Clock skew:
the maximum delay between clock arrival at
any two registers.
Once clock cycle is as small as the pipeline
overhead:
No further pipelining would be useful.
Very deep pipelines may not be useful.
Pipeline Registers
Pipeline registers are essential part of pipelines:
There are 4 groups of pipeline registers in 5 stage pipeline.
Each group saves output from one stage and passes it as input
to the next stage:
IF/ID
ID/EX
EX/MEM
MEM/WB
This way, each time “something is computed”...
Effective address, Immediate value, Register content, etc.
It is saved safely in the context of the instruction that needs
it.
Looking At The Big Picture
• Overall the most time that an non-pipelined
instruction can take is 5 clock cycles. Below is
a summary:
• Branch - 2 clock cycles
• Store - 4 clock cycles
• Other - 5 clock cycles
• EX: Assuming branch instructions account for
12% of all instructions and stores account for
10%, what is the average CPI of a non-
pipelined CPU?
ANS: 0.12*2+0.10*4+0.78*5 = 4.54
Assignment
Find out total time to processes 100 tasks in
a 2-stage pipeline with a cycle time 10ns.
Repeat the above problem assuming latching
in pipeline require 2ns.
A pipeline has 4-stage with time delays 1 =
60ns, 2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of tld = 10ns. What
is the cycle time of this pipeline?
What is the clock frequency of the above
pipeline?
Instruction-Level Parallelism
• What is ILP (Instruction-Level Parallelism)?
– Parallel execution of different instructions
belonging to the same thread.
• A thread usually consists of several basic
blocks:
– As well as several branches and loops.
• Basic block:
– A sequence of instructions not having a branch
instruction.
Cont…
• Instruction pipelines can effectively exploit
parallelism in a basic block:
– An n-stage pipeline can improve performance up
to n times.
– Does not require much investment in hardware
– Transparent to the programmers.
• Pipelining can be viewed to:
– Decrease average CPI, and/or
– Decrease clock cycle time for instructions.
Drags on Pipeline Performance
• Factors that can degrade pipeline performance
– Unbalanced stages
– Pipeline overheads
– Clock skew
– Hazards
• Hazards cause the worst drag on the
performance of a pipeline.
The Classical RISC: 5 Stage Pipeline
• In an ideal case to implement a pipeline we
just need to start a new instruction at each
clock cycle.
• Unfortunately there are many problems while
trying to implement this.
• We look at each stage of instruction execution
as being independent, we can see how
instructions can be “overlapped”.
Problems With The Previous Figure
• The memory is accessed twice during each clock
cycle. This problem is avoided by using separate data
and instruction caches.
• It is important to note that if the clock period is the
same for a pipelined processor and an non-pipelined
processor, the memory must work five times faster.
• Another problem that we can observe is that the
registers are accessed twice every clock cycle. To try
to avoid a resource conflict we perform the register
write in the first half of the cycle and the read in the
second half of the cycle.
Problems With The Previous Figure
(continued)
• We write in the first half therefore an write
operation can be read by another instruction
further down the pipeline.
• A third problem arises with the interaction of

the pipeline with the PC. We use an adder to
increment PC by the end of IF. Within ID we
may branch and modify PC. How does this
affect the pipeline?
Pipeline Hazards
• The performance gain from using pipelining
occurs because we can start the execution of a
new instruction each clock cycle. In a real
implementation this is not always possible.
• What is a pipeline hazard?
 A situation that prevent s an instruction from executing
during its designated clock cycles.
• Pipeline hazards prevent the execution of the
next instruction during the appropriate clock
cycle.
Types Of Hazards
Structural hazards arise from resource conflicts
when the hardware cannot support all possible
combinations of instructions simultaneously in
overlapped execution.
Data hazards arise when an instruction depends on
the results of a previous instruction in a way that
is exposed by the overlapping of instructions in
the pipeline.
Control hazards arise from the pipelining of
branches and other instructions that change the
PC.
Structural Hazard: Example
IF ID EX ME WB
IF E
ID M
EXE ME WB
IF ID M
EXE ME WB
IF ID M
EXE ME WB
M
An Example of a Structural
Hazard
ALU
Load Mem Reg DM Reg
ALU
Instruction 1 Mem Reg DM Reg
ALU
ALU
ALU
Time Would there be a hazard here?

Performance with Stalls
• Stalls degrade performance of a
pipeline:
– Result in deviation from 1 instruction
executing/clock cycle.
– Let’s examine by how much stalls can

impact CPI…
A Hazard Will Cause A Pipeline Stall
• Some performance expressions involve a
realistic pipeline in terms of CPI. It is assumed
that the clock period is the same for pipelined
and unpipelined implementations.
Speedup = CPI Unpipelined / CPI pipelined
= Pipeline Depth / ( 1 + Stalls per
Inst)
= Avg Inst Time Unpipelined / Avg Inst Time

Pipelined
Stalls and Performance
• CPI pipelined =
=Ideal CPI + Pipeline stall cycles per instruction
=1 + Pipeline stall cycles per instruction
• Ignoring overhead and assuming stages are
balanced:
CPI unpipeline d
Speedup 
1  pipeline stall cycles per instructio n
Pipeline depth

1  Pipeline stall cycles per instructio n
Stalls and Performance
(Contd…)
• Alternatively, improving the clock cycle time
pipelinning, CPI of Both is 1
Speedup = CPI unpipelined x Clock cycle time unpipelined

CPI pipelined Clock cycle time pipelined
Clock cycle unpipeline d

Clock cycle pipelined 
Pipeline depth
Alternate Speedup Expression
Clock cycle unpipeline d
Pipeline depth 
Clock cycle pipelined
1
Speedup from pipelining 
Clock cycle time unpipeline d

Clock cycle time pipelined
1
 Pipeline depth
Defining Performance
• To maximize performance, we want to
minimize response time or execution time for
some task. We can relate performance and
execution time for a computer X as
1
Performance X 
ExecutionT ime X
Performance X  PerformanceY
ExecutionT imeY  Executrion Time X

Contd…
Relative Performance:
Performance X
n
PerformanceY
Performance X ExecutionT imeY

 n
PerformanceY ExecutionT ime X
Contd…
If computer A runs a program in 10 seconds and
computer B runs the same program in 15
seconds, how much faster is A than B?
Performance A ExecutionT ime B
 n
Performance B ExecutionT ime A
Thus, the performance ratio is 15/10=1.5

And A is therefore 1.5 times faster than B.
CPU performance and its factors
CPU execution time for a program
CPU clock cycles for a program clock cycles time
CPU execution time for a program

CPU clock cycles for a program

clock rate
Example
• A program runs in 10 seconds on computer A,
which has a 2 GHz clock. We wish to build a
computer B, which will run this program in 6
seconds. If it is possible to increase the clock
rate, then computer B requires 1.2 times as
many clock cycles as computer A for this
program. What is the clock rate we should
target?
Contd..
• Number of clock cycles/ CPU clock cycles for A
CPU clock cycles A

CPU time A 
Clock rate A
CPU clock cycles A

10 Seconds 
2 10 9 cycles / sec
CPU clock cycles A 10 sec onds 2 10 9 cycles / sec

20 10 9 cycles
Contd..
• Number of clock cycles/ CPU clock cycles for B
can be found as
1.2 CPU clock cycles A
CPU time B 
Clock rate B
1.2 20 10 9 cycles
6 seconds 
Clock rate B
9 To run the
12 2 10 cycles
Clock rate B  program in 6
6 sec onds
seconds, B must
4 10 9 cycles / sec ond 4GHz
have twice the
clock rate of A.
Instruction Performance
• Average clock cycles per instruction is
abbreviated as CPI
CPU clock cycles for a program  Instructio ns for a program 
Average clock cycles per instructio n
• We have a computer A with 250ps clock cycle

time and a CPI of 2.0 for some program, and
a computer B with 500ps clock cycle time and
a CPI of 1.2 for same program. Which
computer is faster and how much?
Contd…
• Ans: CPUA clock cycles = P X 2.0
• CPUB clock cycles = P X 1.2
• CPU time for each computer
• CPUA time = P X 2.0 X 250ps = 500 P ps
• CPUB time = PX 1.2 X 500 ps = 600 P ps
CPU Performance A ExecutionT ime B 600 Pps
  1.2
CPU Performance B ExecutionT ime A 500 Pps
• A is 1.2 times as fast as B

Dealing With Structural Hazards
• Arise from resource conflicts among instructions
executing concurrently:
– Same resource is required by two (or more)
concurrently executing instructions at the same
time.
• Easy way to avoid structural hazards:
– Duplicate resources (sometimes not practical)
– Memory interleaving ( lower & higher order )
Contd…
• Examples of Resolution of Structural Hazard:
– An ALU to perform an arithmetic operation

and an adder to increment PC.
– Separate data cache and instruction cache

accessed simultaneously in the same cycle.
How is it Resolved?
ALU
Load Mem Reg DM Reg
ALU
ALU
Stall Bubble Bubble Bubble Bubble Bubble
ALU
Time A Pipeline can be stalled by inserting a “bubble” or

NOP
• A structural hazard is dealt with by inserting a
stall or pipeline bubble into the pipeline.
• This means that for that clock cycle, nothing
happens for that instruction.
• This effectively “slides” that instruction, and
subsequent instructions, by one clock cycle.
• This effectively increases the average CPI.
Example
• Assume that you need to compare two
processors, one with a structural hazard that
occurs 40% for the time, causing a stall.
Assume that the processor with the hazard
has a clock rate 1.05 times faster than the
processor without the hazard. How fast is the
processor with the hazard compared to the
one without the hazard?
Contd…
Note: Both r pipelined processor
Speedup = CPI no haz Clock cycle time no haz

x
CPI haz Clock cycle time haz
1 1
Speedup = x
1+0.4*1 1/1.05
= 0.75
(continued)
• We can see that even though the clock speed
of the processor with the hazard is a little
faster, the speedup is still less than 1.
• Therefore the hazard has quite an effect on
the performance.
• Sometimes computer architects will opt to
design a processor that exhibits a structural
hazard. Why?
• A: The improvement to the processor data path is too costly.
• B: The hazard occurs rarely enough so that the processor will still
perform to specifications.
An Example of Performance
Impact of Structural Hazard
• Assume:
– Pipelined processor.
– Data references constitute 40% of an instruction
mix.
– Ideal CPI of the pipelined machine is 1.
– Consider two cases:
• Unified data and instruction cache vs. separate data and
instruction cache.
• What is the impact on performance?
An Example
Cont…
• Avg. Inst. Time = CPI x Clock Cycle Time
(i) For Separate cache: Avg. Instr. Time=1*1=1
(ii) For Unified cache case:
= (1 + 0.4 x 1) x (Clock cycle timeideal)
= 1.4 x Clock cycle timeideal=1.4
• Speedup= 1/1.4
= 0.7
• 30% degradation in performance
Data Dependences and Hazards
• Determining how one instruction depends on
another is critical to determining how much
parallelism exists in a program and how that
parallelism can be exploited.
Data Dependences
There are three different types of dependences:
• Data Dependences (also called true data
dependences), Name Dependences and
Control Dependences.
• An instruction j is data dependent on
instruction i if either of the following holds:
 Instruction i produces a result that may be used by
instruction j, or
 Instruction j is data dependent on instruction k, and
instruction k is data dependent on instruction i.
Consider the MIPS code sequence
That increments a vector of values in memory
(starting at 0(R1) , and with the last element at
8(R2)
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1!=R2
Data Dependences
Contd…
• A data value may flow between instructions
either through registers or through memory
locations.
• When the data flow occurs in a register,
detecting the dependence is straight forward
since the register names are fixed in the
instructions
• Although it gets more complicated when
branches intervene
Contd…
• Dependences that flow through memory
locations are more difficult to detect
• Since two addresses may refer to the same
location but look different:
• For example, 100(R4) and 20(R6) may be
identical memory addresses.
• Effective address of a load or store may change
from one execution of the instruction to another
(so that 20(R4) and 20(R4) may be different
Detecting Data Dependences
• A data value may flow between instructions:
– (i) through registers
– (ii) through memory locations.
• When data flow is through a register:
– Detection is rather straight forward.
• When data flow is through a memory location:
– Detection is difficult.
– Two addresses may refer to the same memory
location but look different.
100(R4) and 20(R6)
Name Dependences
• A Name Dependence occurs when two
instructions use the same register or memory
location, called a name
• There are two types of name dependences
between an instruction i that preceedes
instruction j in program order:
• Antidependence,
• Output Dependence
Contd…
• An Antidependence: between instruction i and
instruction j occurs when instruction J writes a
register or memory location that instruction i
reads.
• The original ordering must be preserved to
ensure that i reads the correct value. There is
an antidependence between S.D and DADDIU
on register R1, in the MIPS code sequence next
slide.
Consider the MIPS code sequence
That increments a vector of values in memory
(starting at 0(R1) , and with the last element at
8(R2)
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1!=R2), by a scalar
in register F2.
Contd…
• An Output Dependence occurs when
instruction i and instruction j write the same
register or memory location.
• The ordering between the instructions must be

preserved to ensure that the value finally
written corresponds to instruction j.
Data Hazards
• Occur when an instruction under execution
depends on:
– Data from an instruction ahead in pipeline.
A=B+C;
• Example: D=A+E;
A=B+ IF ID EX MEM WB
C;
D=A+ IF ID EXE MEM WB
E;
– Dependent instruction uses old data:

• Results in wrong computations
Types of Data Hazards
• Data hazards are of three types:
– Read After Write (RAW)
– Write After Read (WAR)
– Write After Write (WAW)
• With an in-order execution machine:
– WAW, WAR hazards can not occur.
• Assume instruction i is issued before j.
Read after Write (RAW) Hazards
• Hazard between two instructions i & j may
occur when j attempts to read some data
object that has been modified by i.
– instruction j tries to read its operand before
instruction i writes it.
– j would incorrectly receive an old or incorrect
value.
• Example: … j i …
i: ADD R1, R2, R3 Instruction j is a Instruction i is a

read instruction write instruction
j: SUB R4, R1, R6 issued after i issued before j
Read after Write (RAW) Hazards
Instn I RAW
D(I) R(I)
Write
D(J) Instn J
R(J)
Read
R (I) ∩ D (J) ≠ Ø for RAW

RAW Dependency: More Examples
• Example program (a):
–i1: load r1, addr;
–i2: add r2, r1,r1;
• Program (b):
–i1: mul r1, r4, r5;
–i2: add r2, r1, r1;
• Both cases, i2 does not get operand until i1
has completed writing the result
–In (a) this is due to load-use dependency
–In (b) this is due to define-use dependency
Write after Read (WAR) Hazards
– Instruction j tries to write its operand at
destination before instruction i read it.
– i would incorrectly receive a new or incorrect
value.
• WAR hazards do not usually occur because of the
amount of time between the read cycle and write
cycle in a pipeline.
WAR i: ADD R1, R2, R3
hazards j: SUB R2, R4, R6
occur due … j i …
to Anti
Instruction j is a Instruction i is a
dependency write instruction read instruction
issued after i issued before j
.
Write after Read (WAR) Hazards
Instn J WAR
D(J) R(J)
Write
D(I) InstnI
R(I)
Read
D (I) ∩ R (J) ≠ Ø for WAR

Write After Write (WAW) Hazards
• WAW hazard:
– Both i & j wants to modify a same data object.
– instruction j tries to write an operand before
instruction i writes it.
– Writes are performed in wrong order.
• Example: i: DIV F1, F2, F3 (How can
j: SUB F1, F4, F6 this
happen???)
WAW … j i …
hazards
Instruction j is a Instruction i is a
occur due write instruction write instruction
issued after i issued before j
to output
dependence
.
Write After Write (WAW) Hazards
Instn I
D(I) R(I) WAW
Write
R(J) Instn J
D(J)
Write
R (I) ∩ R (J) ≠ Ø for WAW

Inter-Instruction Dependences
 Data dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
 Anti-dependence
r3  r1 op r2 Write-after-Read
False
r1  r4 op r5 (WAR) Dependenc
y
 Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
Control dependence
Data Dependencies : Summary
Data dependencies
in straight-line code
RAW WAR WAW

Read After Write Write After Read Write After Write
dependency dependency dependency
( Flow dependency ) ( Anti dependency ) ( Output dependency )
Load-Use Define-Use
dependency dependency
True dependency False dependency

Cannot be overcome Can be eliminated by register
renaming
Recollect Data Hazards
What causes them?
– Pipelining changes the order of read/write
accesses to operands.
– Order differs from that of an unpipelined
machine.
• Example:
– ADD R1, R2, R3
– SUB R4, R1, R5 For MIPS, ADD writes
the register in WB but
SUB needs it in ID.
This is a data hazard

Illustration of a Data Hazard
ALU
ADD R1, R2, R3 Mem Reg DM Reg
ALU
SUB R4, R1, R5 Mem Reg DM Reg
ALU
AND R6, R1, R7 Mem Reg DM
ALU
OR R8, R1, R9 Mem Reg
XOR R10, R1, R11 Mem Reg
Time
ADD instruction causes a hazard in next 3

instructions because register not written until
after those 3 read it.
Solutions to Data Hazard
• Operand forwarding
• Pipeline interlock
• By S/W (NOP)
• Reordering the instruction
Forwarding
• Simplest solution to data hazard:
– forwarding
• Result of the ADD instruction not really
needed:
– until after ADD actually produces it.
• Can we move the result from EX/MEM
register to the beginning of ALU (where SUB
needs it)?
– Yes!
Forwarding
cont…
• Generally speaking:
– Forwarding occurs when a result is
passed directly to the functional unit
that requires it.
– Result goes from output of one pipeline
stage to input of another.
Forwarding Technique
Latch Latch
EXECUTE WRITE
ALU RESULT
Forwarding Path
When Can We Forward?
ALU
ADD R1, R2, R3 Mem Reg DM Reg SUB gets info.
from EX/MEM
pipe register
ALU
SUB R4, R1, R5 Mem Reg DM Reg
AND gets info.
ALU
AND R6, R1, R7 Mem Reg DM
from MEM/WB
pipe register
ALU
OR R8, R1, R9 Mem Reg
OR gets info. by
XOR R10, R1, R11 Mem Reg forwarding from
register file
If line goes “forward” you can do forwarding.

Time If its drawn backward, it’s physically impossible.
General Data Forwarding
• It is easy to see how data forwarding can be

used by drawing out the pipelined execution of
each instruction.
• Now consider the following instructions:
DADD R1, R2, R3
LD R4, O(R1)
SD R4, 12(R1)
Problems
• Can data forwarding prevent all data hazards?
• NO!
• The following operations will still cause a data
hazard. This happens because the further
down the pipeline we get, the less we can use
forwarding.
LD R1, O(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
Problems
• We can avoid the hazard by using a Pipeline
interlock.
• The pipeline interlock will detect when data
forwarding will not be able to get the data to
the next instruction in time.
• A stall is introduced until the instruction can
get the appropriate data from the previous
instruction.
Handling data hazard by S/W
• Compiler introduce NOP in between two
instructions
• NOP = a piece of code which keeps a gap
between two instruction
• Detection of the dependency is left entirely
on the S/W
• Advantage :- We find the easy technique
called as instruction reordering.
Instruction Reordering
• ADD R1 , R2 , R3
• SUB R4 , R1 , R5 Before
• XOR R8 , R6 , R7
• AND R9 , R10 , R11
• ADD R1 , R2 , R3
• XOR R8 , R6 , R7 After
• AND R9 , R10 , R11
• SUB R4 , R1 , R5
Instruction Execution:
MIPS Data path
• Can break down the process of “running” an
instruction into stages.
• These stages are what needs to be done to
complete the execution of each instruction.
Some instructions will not require some stages.
MIPS Data path
The DLX (MIPS) datapath allows every instruction to be

executed in 4 or 5 cycles
Instruction Execution
Contd…
1. Instruction Fetch (IF) - Get the instruction to be
executed.
IR  M[PC]
NPC  PC + 4
IR – Instruction register
NPC – Next program counter
Contd…
2. Instruction Decode/Register Fetch (ID) –
Figure out what the instruction is supposed to
do and what it needs.
A  Register File[Rs]
B  Register File[Rt]
Imm  {(IR16)16, IR15..0}
A & B & Imm are temporary registers that hold

inputs to the ALU which is in the Execute Stage
Contd…
3. Execution (EX) -The instruction has been decoded, so
execution can be split according to instruction type.
Reg-Reg
ALU instr: ALUout  A op B
Reg-Imm: ALUout  A op Imm
Branch: ALUout  NPC + Imm
Cond  (A {==, !=} 0)
LD/ST: ALUout  A op Imm
to form effective Address
Contd…
4. Memory Access/Branch Completion (MEM) – Besides
the IF stage this is the only stage that access the
memory to load and store data.
Load: LMD = Mem[ALUout]

Store: Mem[ALUout]  B
Branch: if (cond) PC  ALUout
Jump: PC  ALUout
ELSE: PC  NPC
LMD=Load Memory Data Register

Contd…
5. Write-Back (WB) – Store all the results and loads
back to registers.
Reg-Reg
ALU instr: Rd  ALUoutput
Load: Rd  LMD
Reg-Imm: Rt  ALUoutput
Amdahl’s Law
Quantifies overall performance gain due to improve in a
part of a computation. (CPU Bound)
Amdahl’s Law:
Performance improvement gained from using some faster
mode of execution is limited by the amount of time the
enhancement is actually used
Performance for entire task using
Speedup = enhancement when possible
Performance for entire task without
OR using enhancement
Speedup = Execution time for a task without enhancement

Execution time for the task using enhancement
Amdahl’s Law and Speedup
Speedup tells us:
How much faster a machine will run due to an
enhancement.
For using Amdahl’s law two things should be
considered:
1st… FRACTION ENHANCED :-Fraction of the
computation time in the original machine that can
use the enhancement
It is always less than or equal to 1
If a program executes in 30 seconds and 15 seconds of
exec. uses enhancement, fraction = ½
Amdahl’s Law and Speedup
 2nd… SPEEDUP ENHANCED :-Improvement gained
by enhancement, that is how much faster the
task would run if the enhanced mode is used
for entire program.
 Means the time of original mode over the
time of enhanced mode
 It is always greater than 1
If enhanced task takes 3.5 seconds and
original task took 7secs, we say the speedup
is 2.
Amdahl’s Law Equations
Fractionenhanced
Execution timenew = Execution timeold x (1 – Fractionenhanced) +
Speedupenhanced
Execution Timeold 1
Speedupoverall = =
Execution Timenew Fractionenhanced
(1 – Fractionenhanced) +
Speedupenhanced
Use previous equation,
Solve for speedup
Assignments
Q. Consider a 5 stage pipeline system. If there
exist 40% of branch instruction, penalty for
branch instruction is 3 clock cycle & clock cycle
time is 20ns.Then, find out the throughput
with the pipeline system.
Q. A 5 stage pipeline separated by a 10ns clock.
If the non-pipeline clock is also having same
duration and the pipeline efficiency is 90%
then calculate the speed up factor.
Assignments
Q. A five stage pipeline processor has IF, ID, EXE,
MEM, WB. The IF, ID, MEM, WB stages takes 1
clock cycles each for any instruction. But at
the same time EXE stage takes 2 clock cycle for
ADD & SUB instructions, and 3 clock cycles for
MUL & DIV instructions respectively. Other
remaining instructions will take 1 clock cycle
to complete their EXE phase.
Assignments
Q. Consider the following instructions:-
MUL R1, R2, R4 For the above sequence:-
Find out all type of
STORE R1, 4(R3) data dependency?
DIV R4, R1, R2 Find out the difference
between total number of
ADD R7, R1, R4 clock cycles required to
MUL R9, R7, R4 complete the execution
of above given
LOAD R1, 8(R6) instruction set using
SUB R1, R9, R7 operand forwarding and
without using operand
forwarding?
Assignments
Q. Your company has just brought a new dual
Pentium processor, and you have been tasked
with optimizing your software for this
processor. You will run two applications on this
dual Pentium, but the resource requirements
are not equal. The first application needs 80%
of the resources, and the other only 20% of
the resources.
Assignments
(i) Given that 40% of the first application is
parallelizable, how much speed up would you
achieve with that application if run in
isolation?
(ii) Given that 99% of the second application is
parallelizable, how much overall system speed
up would you get?
Control Hazards
• Result from branch and other instructions that change
the flow of a program (i.e. change PC).
• Example: 1: If(cond){
2: s1}
3: s2
• Statement in line 2 is control dependent on statement

at line 1.
• Until condition evaluation completes:
– It is not known whether s1 or s2 will execute next.
Control Hazards
• Control hazards are caused by branches in the
code.
• During the IF stage remember that the PC is
incremented by 4 in preparation for the next IF
cycle of the next instruction.
• What happens if there is a branch performed
and we aren’t simply incrementing the PC by 4.
• The easiest way to deal with the occurrence of
a branch is to perform the IF stage again once
the branch occurs.
Four Simple Control/Branch Hazard
Solutions
• These following solutions assume that we are
dealing with Static Branches (Compile time).
Meaning that the actions taken during a
branch do not change.
#1. Flush Pipeline/ Stall
#2. Predict Branch Not Taken:
#3. Predict Branch Taken
#4. Delayed branch.
Branch Hazard Solutions
#1. Flush Pipeline/ Stall
• until branch direction is clear – flushing pipe
, once an instruction is detected to be branch
during the ID stage.
• Let us see an example, we will stall the

pipeline until the branch is resolved (in that
case we repeated the IF stage until the branch
is resolved and modifies the PC)
Performing IF Twice
• We take a big performance hit by performing
the instruction fetch whenever a branch
occurs. Note, this happens even if the branch
is taken or not.
• This guarantees that the PC will get the
correct value.
IF ID EX MEM WB
branch IF ID EX MEM WB
IF IF ID EX MEM WB
Control Hazards solutions
#2. Predict Branch Not Taken:
• What if we treat every branch as “not taken”
remember that not only do we read the registers
during ID, but we also perform an equality test in
case we need to branch or not.
• We can improve performance by assuming that
the branch will not be taken.
– Execute successor instructions in sequence as if there

is no branch
– undo instructions in pipeline if branch actually taken
Control Hazards solutions:
Predict Branch Not Taken: Contd..
• The “branch-not taken” scheme is the same as

performing the IF stage a second time in our 5
stage pipeline if the branch is taken.
• If not there is no performance degradation.
• 47% branches not taken on average
Control Hazards solutions:
Predict Branch Not Taken: Contd..
#3 Predict Branch Taken
– The “branch taken” scheme is no benefit in our case because
we evaluate the branch target address in the ID stage.
– 53% branches taken on average.
– But branch target address not available after IF in
MIPS
• MIPS still incurs 1 cycle branch penalty even with
predict taken
• LOOP or in some other machines: branch target
known before branch outcome computed, significant
benefits can be accrued.
#4: Delayed Branch
• The fourth method for dealing with a control hazard
is to implement a “delayed” branch scheme.
• In this scheme an instruction is inserted into the
pipeline that is useful and not dependent on
whether the branch is taken or not. It is the job of
the compiler to determine the delayed branch
instruction.
• If the branch is actually taken, we need to clear the
pipeline of any code loaded in from the “not-taken”
path.
Delayed Branch Contd…
• Likewise we can assume that the branch is

always taken. Does this work in our “5-stage”
pipeline?
No, the branch target is computed during the ID cycle.
• Some processors will have the target address
computed in time for the IF stage of the next
instruction so there is no delay.
cont…
#4: Delayed Branch
–Insert unrelated successor in the branch delay
slot
branch instruction
sequential successor1
sequential successor2
........ Branch delay of length n
sequential successorn
branch target (if taken)
–1 slot delay required in 5 stage pipeline

The behavior of a delayed branch
Delayed Branch
• Simple idea: Put an instruction that would be
executed anyway right after a branch.
Branch IF ID EX MEM WB
IF delay
ID EX slot
MEM WB
Delayed slot instruction
IF ID EX MEM WB
Branch target OR
•successor
Question: What instruction do we put in the delay slot?
• Answer: one that can safely be executed no matter what the
branch does.
– The compiler decides this.
Delayed Branch
• One possibility: An instruction from before
• Example:
DADD R1, R2, R3
DADD R1, R2, R3
if R2 == 0 then
if R2 == 0 then
DADD R1, R2, R3

delay slot
...
• The DADD instruction is executed no matter what
happens in the branch:
– Because it is executed before the branch!
– Therefore, it can be moved
Delayed Branch
• We get to execute the “DADD” execution
“for free”
branch IF ID EX MEM WB
IF ID EX MEM WB
add instruction
IF ID EX MEM WB
branch target/successor
By this time, we know whether

to take the branch or whether
not
to take it
Delayed Branch
• Another possibility: An instruction much before from
target
• Example: DSUB R4, R5, R6
...
DADD R1, R2, R3
if R1 == 0 then
delay slot
• The DSUB instruction can be replicated into the delay
slot, and the branch target can be changed
Delayed Branch
• Example: DSUB R4, R5, R6

...
DADD R1, R2, R3
if R1 == 0 then
DSUB R4, R5, R6
• The DSUB instruction can be replicated into the delay
slot, and the branch target can be changed
Delayed Branch
• Yet another possibility: An instruction from inside the
taken path: fall through
DADD R1, R2, R3
• Example:
if R1 == 0 then
delay slot
OR R7, R8, R9
DSUB R4, R5, R6
• The OR instruction can be moved into the delay slot
ONLY IF its execution doesn’t disrupt the program
execution (e.g., R7 is overwritten later)
Delayed Branch
• Third possibility: An instruction from inside the taken
path DADD R1, R2, R3
• Example:
if R1 == 0 then
OR R7, R8, R9
OR R7, R8, R9
DSUB R4, R5, R6
• The OR instruction can be moved into the delay slot
ONLY IF its execution doesn’t disrupt the program
execution (e.g., R7 is overwritten later)
Performance of branch with Stalls
• Stalls degrade performance of a

pipeline:
– Result in deviation from 1 instruction
executing/clock cycle.
– Let’s examine by how much stalls can
impact CPI…
Stalls and Performance with
branch
• CPI pipelined =
=Ideal CPI + Pipeline stall cycles per instruction
=1 + Pipeline stall cycles per instruction
Performance of branch instn
• Pipeline speed up
Pipeline depth
1+ pipeline stall cycle from
branch
• Pipeline stall cycle from branches = Branch
frequency * branch penalty
• Pipeline speed up = Pipeline depth
1+ Branch frequency *
Branch Penalty
An Example of Impact of Branch
Penalty
• Assume for a MIPS pipeline:
– 16% of all instructions are branches:
• 4% unconditional branches: 3 cycle
penalty
• 12% conditional: 50% taken: 3 cycle
penalty
Impact of Branch Penalty
• For a sequence of N instructions:
– N cycles to initiate each
– 3 * 0.04 * N delays due to unconditional
branches
– 0.5 * 3 * 0.12 * N delays due to conditional
taken
• Overall CPI=
– 1.3*N
– (or 1.3 cycles/instruction)
– 30% Performance Hit!!!
Reducing Branch Penalty
• Two approaches:
1) Move condition comparator to ID stage:
• Decide branch outcome and target address
in the ID stage itself:
• Reduces branch delay to 2 cycles.
2) Branch prediction
Basic Pipeline Scheduling and Loop
Unrolling: Running Example
• This code adds a scalar to a vector:
for (i=1000; i>0; i=i–1)
x[i] = x[i] + s;
• Assume following latency all examples
Instruction Instruction Latency
producing Result using result in clock cycles
FP ALU op Another FP ALU op 3

FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
Int load latency 1
An Example
Loop: LD F0, 0 (R1) ; load the vector element
ADDD F4, F0, F2 ; add the scalar in F2
SD 0 (R1), F4 ; store the vector element
DAAD R1, R1, #-8 ; decrement the pointer by
; 8 bytes (per DW)
BNEZ R1, R2 Loop ; branch when it’s not zero
Instruction Producer Instruction Consumer Latency
FP ALU op FP ALU op 3
FP ALU op Store Double 2
Load Double FP ALU op 1
Load Double Store Double 0
me that latency for Integer ops is zero and latency for Integer load is 1
An Example
Loop: LD F0, 0 (R1) 1
STALL 2
ADDD F4, F0, F2 3
STALL 4
STALL 5
SD 0 (R1), F4 6
DAADUI R1, R1, #-8 7
STALL 8
BNEZ R1, R2 Loop 9
This requires 9 Cycles per iteration

An Example
Scheduling
Loop: LD F0, 0 (R1) 1

DADDUI R1, R1, #-8 2
ADDD F4, F0, F2 3
STALL 4
STALL 5
SD 8 (R1), F4 6
BNE R1, R2 Loop 7
This requires 7 Cycles per iteration

An Example Unrolling
1 Loop: L.D F0,0(R1)

2 ADD.D F4,F0,F2
3 S.D 0(R1),F4 ;drop DSUBUI & BNE
4 L.D F6,-8(R1)
5 ADD.D F8,F6,F2
6 S.D -8(R1),F8 ;drop DSUBUI & BNE
7 L.D F10,-16(R1)
8 ADD.D F12,F10,F2
9 S.D -16(R1),F12 ;drop DSUBUI & BNE
10 L.D F14,-24(R1)
11 ADD.D F16,F14,F2
12 S.D -24(R1),F16
13 DADDUI R1,R1,#-32
14 BNE R1, R2,LOOP
LD 1 stall, ADDD 2, DADDUI +14 This requires 6.75 Cycles

per iteration
Unrolling + Scheduling
An Example
Loop : LD F0, 0 (R1) 1
LD F6, - 8 (R1) 2
LD F10, -16 (R1) 3
LD F14, -24 (R1) 4
ADDD F4, F0, F2 5
ADDD F8, F6, F2 6
ADDD F12, F10, F2 7
ADDD F16, F14, F2 8
SD 0 (R1), F4 9
SD -8 (R1), F8 10
DADDUI R1, R1, #-32 11
SD 16 (R1), F12 12
SD 8 (R1), F16 13
BNEZ R1, R2 LOOP 14
This requires 3.5 Cycles per iteration
2-bit Prediction Scheme
• This method is more reliable than using a
single bit to represent whether the branch
was recently taken or not.
• The use of a 2-bit predictor will allow
branches that favor taken (or not taken) to be
mispredicted less often than the one-bit case.
Dynamic Branch Prediction
• Solution: 2-bit scheme where change prediction
only if get misprediction twice
• Red: stop, not taken
• Green: go, taken
Taken
Predicted Not taken Predicted
Taken (11) Taken (10)
Taken
Taken Not taken
Not taken
Predicted(01) Predicted(00)
not Taken not Taken
Taken
Not taken
Branch Predictors
• The size of a branch predictor memory will
only increase it’s effectiveness so much.
• We also need to address the effectiveness of
the scheme used. Just increasing the number
of bits in the predictor doesn’t do very much
either.
• Some other predictors include:
– Correlating Predictors
– Tournament Predictors
Branch Predictors
• Correlating predictors will use the history of a
local branch AND some overall information on
how branches are executing to make a
decision whether to execute or not.
• Tournament Predictors are even more
sophisticated in that they will use multiple
predictors local and global and enable them
with a selector to improve accuracy.
Why Dynamic Scheduling?
• All the static(complier) techniques discussed so
far use in-order instruction issue.
• That means that if an instruction is stalled in the
pipeline, no later instructions can proceed.
• With in-order issue, if two instructions have a
hazard between them, the pipeline will stall,
even if there are later instructions that are
independent and would not stall.
Why Dynamic Scheduling?
• Several early processors used another
approach, called dynamic scheduling, whereby
the hardware rearranges the instruction
execution to reduce the stalls.
Advantages of Dynamic Scheduling
• Handles cases when dependences unknown at compile
time (eg memory reference)
• Simplifies compiler code compiled for one pipeline
runs efficiently on different pipeline
• Hardware speculation, a technique with significant
performance advantages, that builds on dynamic
scheduling
• Key idea: instructions behind stall can proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
• Out-of-order execution => out-of-order completion.
Instruction Parallelism by HW
• Enables out-of-order execution and allows out-
of-order completion
• Will distinguish when an instruction begins
execution and when it completes execution; in
between instruction in execution
• dynamically scheduled pipeline:: all
instructions pass through issue stage in order
(in-order issue)
Dynamic Scheduling by Scoreboard:
bookkeeping technique - OLD
• To implement out-of-order execution, ID stage must bee
split into two stages:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read
operands
• Scoreboards date to CDC6600 in 1963
• Instructions execute whenever not dependent on
previous instructions and no hazards.
• CDC 6600: In order issue, out-of-order execution (when
there are no conflicts and the hardware is available). ,
out-of-order commit (or completion)
– No forwarding
Scoreboard Architecture (CDC
6600)
FP
FP Mult
Mult
Functional Units
FP
FP Mult
Mult
Registers
FP
FP Divide
Divide
FP
FP Add
Add
Integer
Integer
SCOREBOARD Memory
SCOREBOARD
Scoreboard Implications
• Out-of-order completion => WAR, WAW hazards?
• Solutions for WAR:
– Stall writeback until registers have been read
– Read registers only during Read Operands stage
• Solution for WAW:
– Detect hazard and stall issue of new instruction until other
instruction completes
• No register renaming
Scoreboard Implications
Cond…
• Need to have multiple instructions in
execution phase => multiple execution units or
pipelined execution units
• Scoreboard keeps track of dependencies
between instructions that have already issued.
• Scoreboard replaces ID, EX, WB with 4 stages
Four Stages of Scoreboard Control
• Issue—decode instructions & check for
structural hazards (ID1)
– Instructions issued in program order (for hazard checking)
– Don’t issue if structural hazard
– Don’t issue if instruction is output dependent on
previously issued but uncompleted instruction (no WAW
hazards)
• Read operands—wait until no data hazards,
then read operands (ID2)
– All real dependencies (RAW hazards) resolved in this
stage. Wait for instructions to write back data.
– No data forwarding
Four Stages of Scoreboard Control
• Execution—operate on operands (EX)
– Functional unit begins execution upon receiving operands.
When result is ready, scoreboard notified execute
complete
• Write result—finish execution (WB)
– Stall until no WAR hazards with previous instructions:
Example: DIVD F0,F2,F4

ADDD F10,F0,F8
SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads
operands
Dynamic Scheduling Using A Scoreboard
Three Parts of the Scoreboard

1. Instruction status—Indicates which of 4 steps the instruction is in
2. Functional unit status—Indicates the state of the functional
unit (FU). 9 fields for each functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready and not yet
read. Set to No after operands are read
3. Register result status—Indicates which functional unit will write
each register, if one exists. Blank when no pending instructions
will write that register
Detailed Scoreboard Pipeline Control
Instructi
Wait until Bookkeeping
on status
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’;
Not busy (FU)
Fj(FU) `S1’; Fk(FU) `S2’;
and not
Issue Qj Result(‘S1’); Qk
result(D)
Result(`S2’);
(WAW)
Rj not Qj; Rk not Qk;
Result(‘D’) FU;
Read Rj and Rk
Rj No; Rk No
operands (RAW)
Execution Functional
complete unit done
f((Fj( f )≠Fi
(FU) f(if Qj(f)=FU then Rj(f)
or Rj( f )=No) Yes);
Write
result
& (Fk( f ) f(if Qk(f)=FU then Rj(f)
≠Fi(FU) or Yes); Result(Fi(FU)) 0;
Rk( f )=No)) Busy(FU) No
(WAR)
• Interger- 1 clock cycles
• Add- 2 clock cycles
• Multi: 10 clock cycles
• Div: 40 clock cycles
Scoreboard Example
Instruction status Read Execution
Write
Instruction j k Issue operands
complete
Result
LD F6 34+ R2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
Scoreboard Example Cycle 1
Instruction status Read ExecutionWrite

complete
Result Issue LD #1
LD F6 34+ R2 1
LD F2 45+ R3
Shows in which cycle
MULTDF0 F2 F4
SUBD F8 F6 F2
the operation occurred.
DIVD F10 F0 F6
ADDDF6 F8 F2
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer

Write LD #2 can’t issue
complete
Result
LD F6 34+ R2 1 2
since integer unit
LD F2 45+ R3 is busy.
MULTDF0 F2 F4 MULT can’t issue
SUBD F8 F6 F2 because we require
DIVD F10 F0 F6
ADDDF6 F8 F2 in-order issue.
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer

Write
complete
Result
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer

Write
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
Write
complete
Result
LD F6 34+ R2 1 2 3 4
Issue LD #2 since
LD F2 45+ R3 5 integer unit is now
MULTDF0 F2 F4 free.
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
Write
complete
Result
LD F6 34+ R2 1 2 3 4 Issue MULT.
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
Write
complete
Result MULT can’t read its
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
operands (F2)
MULTDF0 F2 F4 6 because LD #2
SUBD F8 F6 F2 7 hasn’t finished.
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
Scoreboard Example Cycle 8a
Write
complete
Result DIVD issues.
LD F6 34+ R2 1 2 3 4
MULT and SUBD both
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6 waiting for F2.
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide
Scoreboard Example Cycle 8b
Write
complete
Result LD #2 writes F2.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add Divide
Write Now MULT and SUBD
complete
Result
LD F6 34+ R2 1 2 3 4
can both read F2.
LD F2 45+ R3 5 6 7 8 How can both
MULTDF0 F2 F4 6 9 instructions do
SUBD F8 F6 F2 7 9 this at the same
DIVD F10 F0 F6 8
time??
ADDDF6 F8 F2
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write ADDD can’t start
complete
Result because add unit is
LD F6 34+ R2 1 2 3 4
busy.
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Integer No
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write
complete
Result SUBD finishes.
LD F6 34+ R2 1 2 3 4
DIVD waiting for
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 F0.
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Integer No
Mult2 No
Add No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide
Write
complete
Result
LD F6 34+ R2 1 2 3 4
ADDD issues.
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13
Integer No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14
Integer No
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write
complete
Result ADDD can’t write
LD F6 34+ R2 1 2 3 4
because of DIVD.
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 RAW!
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write
complete
Result
LD F6 34+ R2 1 2 3 4 Nothing
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
Happens!!
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write
complete
Result MULT completes
LD F6 34+ R2 1 2 3 4
execution.
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Write
complete
Result
LD F6 34+ R2 1 2 3 4
MULT writes.
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult1 No
Mult2 No
Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
Write
complete
Result
LD F6 34+ R2 1 2 3 4
DIVD loads operands
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDDF6 F8 F2 13 14 16
Integer No
Mult1 No
Mult2 No
Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
Write
complete
Result Now ADDD can write
LD F6 34+ R2 1 2 3 4
since WAR removed.
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDDF6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
Write
complete
Result DIVD completes
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
execution
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDDF6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide

Write DONE!!
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDDF6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
Review: Scoreboard
• Limitations of 6600 scoreboard
– No forwarding
– Limited to instructions in basic block (small window)
– Large number of functional units (structural hazards)
– Stall on WAR hazards
– Stall on WAW hazards
DIV.D F0, F2, F4
ADD.D F6, F0, F8
WAR S.D F6, 0(R1)
SUB.D F8, F10, F14 WAW Output dependence
Antidependence MUL.D F6, F10, F8
Name dependence
Another Dynamic Algorithm: Tomasulo
Algorithm
• For IBM 360/91 about 3 years after CDC 6600
• Goal: High Performance without special compilers
• Differences between Tomasulo Algorithm & Scoreboard
– Control & buffers distributed with Functional Units vs.
centralized in scoreboard; called “reservation stations”
– Registers in instructions replaced by pointers to
reservation station buffer
– HW renaming of registers to avoid WAW hazards
– Buffer operand values to avoid WAR hazards
– Common Data Bus broadcasts results to all FUs
– Load and Stores treated as FUs as well
• Why study? Lead to Alpha 21264, HP 8000, MIPS 10000,
Pentium II, Power PC 604 …
FP unit and load-store unit using Tomasulo’s alg.
Another Dynamic Algorithm: Tomasulo Algorithm
DIV.D F0, F2, F4
ADD.D S, F0, F8
S.D S, 0(R1) register renaming
SUB.D T, F10, F14
MUL.D F6, F10, T
• Implemented through reservation stations (rs) per
functional unit
– Buffers an operand as soon as it is available – avoids WAR hazards.
– Pending instr. designate rs that will provide their inputs – avoids WAW
hazards.
– The last write in a sequence of same-register-writing actually updates the
register
– Decentralize hazard detection and execution control
– Instruction results are passed directly to the FU from rs rather than from
registers
• Through common data bus (CDB)
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
Stall if structural hazard, ie. no space in the rs. If
reservation station (rs) is free, the issue logic issues
instr to rs & read operands into rs if ready (Register
renaming => Solves WAR). Make status of
destination register waiting for this latest instn even
if the previous instn writing to this register hasn’t
completed => Solves WAW hazards.
2. Execution—operate on operands (EX)
When both operands are ready then execute;
if not ready, watch CDB for result – Solves RAW
Three Stages of Tomasulo
Algorithm
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available. Write result into
dest. reg. if its status is r. => Solves WAW.
• Normal data bus: data + destination(“go to” bus)
• CDB: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source
address
– Write if matches expected Functional Unit
(produces result)
– Does broadcast
Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk— Value of the source operand.
Qj, Qk— Name of the RS that would provide the source
operands. Value zero means the source operands
already available in Vj or Vk, or is not necessary.
Busy—Indicates reservation station or FU is busy
Register File Status Qi: Qi —Indicates which functional
unit will write each register, if one exists. Blank (0)
when no pending instructions that will write that
register meaning that the value is already available.
Tomasulo Example Cycle 0
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 Load1 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
LD F6 34+ R2 1 Load1 Yes 34+R2
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
LD F6 34+ R2 1 2- Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
SUBD F8 F6 F2 Assume Load takes 2 cycles
DIVD F10 F0 F6
ADDD F6 F8 F2
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
LD F6 34+ R2 1 2--3 Load1 Yes 34+R2
LD F2 45+ R3 2 3- Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
0 Add1 No
0 Add2 No read value
Add3 No
0 Mult1 Yes Mult R(F4) Load2
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 Load2 Yes 45+R3
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2
0 Add1 Yes Sub M(A1) Load2
0 Add2 No
Add3 No
0 Mult1 Yes Mult R(F4) Load2
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2
2 Add1 Yes Sub M(A1) M(A2)
0 Add2 No
Add3 No
10 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 --
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
0 Add2 Yes Add M(A2) Add1
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
0 Add2 Yes Add M(A2) Add1
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
0 Add1 No
2 Add2 Yes Add M1-M2 M(A2)
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 --
0 Add1 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10
0 Add1 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) M1-M2+M(A2)
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes Div M*F4 M(A1)
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5 17 -- 56
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes Div M*F4 M(A1)
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5 17 -- 56 57
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
57 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 result
Example:Scoreboard tables before MUL.D writes results
Instruction Status
Read
Instruction Issue Operands Execution Complete Write Result
L.D F6,34(R2) X X X X
L.D F2,45(R3) X X X X
MUL.D F0,F2,F4 X X X
SUB.D F8,F6,F2 X X X X
DIV.D F10,F0,F6 X
ADD.D F6,F8,F2 X X X
Functional unit status

Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 No No

F0 F2 F4 F6 F8 F10 F12 ……
unit Mult1 Integer Add Divide
Memory Hierarchy
Design and Optimizations
Introduction
• Even a sophisticated processor may perform
well below an ordinary processor:
– Unless supported by matching performance
by the memory system.
• The focus of this module:
– Study how memory system performance has
been enhanced through various innovations
and optimizations.
Typical Memory Hierarchy
Proc/Regs
L1-
Cache
Bigger L2-Cache Faster
L3-Cache (optional)
Memory
Disk, Tape, etc.
• Here we focus on L1/L2/L3 caches, virtual

memory and main memory
What is the Role of a Cache?
• What is cache? A small, fast storage used to
improve average access time to a slow memory.
• Why we need it? To improves memory system
performance:
• Locality of reference (very important) (Analysis
of program)
- Temporal, - Spatial
• Cache block – cache line (Locality of reference)
– A set of contiguous address locations of some size
Four Basic Questions
• Q1: Where can a block be placed in the cache? (Block
placement)
–Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the cache?
(Block identification)
–Tag/Block/word
• Q3: Which block should be replaced on a miss?
(Block replacement)
–Random, LRU, etc.
• Q4: What happens on a write?
(Write strategy)
–Write Back or Write Through (with Write Buffer)
Block Placement
• If a block has only one possible place in
the cache: Direct Mapped
• If a block can be placed anywhere: Fully
Associative
• If a block can be placed in a restricted
subset of the possible places: Set
Associative
– If there are n blocks in each subset: n-way set
associative
• Note that direct-mapped = 1-way set
associative
Trade-offs
• n-way set associative becomes increasingly
difficult (costly) to implement for large n
– Most caches today are either 1-way (direct
mapped), 2-way, or 4-way set associative
• The larger n the lower the likelihood of
thrashing
– e.g., two blocks competing for the same block
frame and being accessed in sequence over and
over
Block Identification
cont…
• Given an address, how do we find

where it goes in the cache?
• This is done by first breaking down an
address into three parts
Index of the set
tag used for offset of the address in
identifying a match the cache block
tag set index block offset
Block address
Direct Mapping Main
memory
Block j of main memory maps Block 0
onto block j modulo 128 of the Block 1
cache
Cache
4: one of 16 words. (each block tag

Block 0
Block 127
tag
has 16=24 words) Block 1
Block 128
7: points to a particular block in Block 129
the cache (128=27) tag

Block 127
5: 5 tag bits are compared with Block 255
the tag bits associated with its Block 256
location in the cache. Identify Block 257

Tag Block Word
which of the 32 blocks that are 5 7 4
resident in the cache (4096/128). Main memory address
Direct-mapped cache. Block 4095

Associative Mapping Main memory
4: one of 16 words. Block 0
Block 1
(each block has 16=24 Cache
words) tag
Block 0
tag
12: 12 tag bits Identify

Block 1
which of the 4096 Block i
blocks that are tag

Block 127
resident in the cache

4096=212. Tag Word
12 4 Block 4095
Main memory address
Associative-mapped cache.
Set-Associative Mapping Main memory
Block 0
Block 1
Cache
4: one of 16 words. (each tag
Block 0
Set 0
block has 16=24 words) tag
Block 1
tag Block 63
Block 2
6: points to a particular Set 1
tag
Block 3
Block 64
set in the cache Block 65
(128/2=64=26)
tag
Block 126
6: 6 tag bits is used to

Set 63
tag Block 127
Block 127
check if the desired block Tag Set Word

Block 128
Block 129
is present (4096/64=26). 6 6 4
Set-associative-mapped cache Main memory address

Block 4095
with two blocks per set.
Cache Write Policies
• Write-through: Information is written to both
the block in the cache and the block in
memory
• Write-back: Information is written back to
memory only when a block frame is replaced:
– Uses a “dirty” bit to indicate whether a block was
actually written to,
– Saves unnecessary writes to memory when a
block is “clean”
Trade-offs
• Write back
– Faster because writes occur at the speed
of the cache, not the memory.
– Faster because multiple writes to the
same block is written back to memory
only once, uses less memory bandwidth.
• Write through
– Easier to implement
Write Allocate, No-write Allocate
– On a read miss, a block has to be brought in
from a lower level memory
• What happens on a write miss?
• Two options:
– Write allocate: A block allocated in cache.
– No-write allocate: No block allocation, but
just written to in main memory.
Write Allocate, No-write Allocate
cont…
• In no-write allocate,
– Only blocks that are read from can be in
cache.
– Write-only blocks are never in cache.
• But typically:
– write-allocate used with write-back
– no-write allocate used with write-through
Memory System Performance
Memory system performance is largely captured
by three parameters,
– Latency, Bandwidth, Average memory access time
(AMAT).
• Latency: The time it takes from the issue of a
memory request to the time the data is
available at the processor.
• Bandwidth: The rate at which data can be
pumped to the processor by the memory
system.
Example Assume a fully associative write-back cache
with many cache entries that starts empty. Below is
a sequence of five memory operations (the address is
in square brackets): Write Mem[100];
WriteMem[100];
Read Mem[200];
WriteMem[200];
WriteMem[100].
What are the number of hits and misses when using
no-write allocate versus write allocate?
Answer
For no-write allocate the address 100 is not in the cache, and
there is no allocation on write, so the first two writes will
result in misses.
Address 200 is also not in the cache, so the read is also a miss.
The subsequent write to address 200 is a hit. The last write to
100 is still a miss. The result for no-write allocate is four
misses and one hit.
For write allocate, the first accesses to 100 and 200 are misses,
and the rest are hits since 100 and 200 are both found in the
cache.
• Thus, the result for write allocate is two misses and three hits.
Average Memory Access Time (AMAT)
• AMAT: The average time it takes for the processor to get a
data item it requests.
• The time it takes to get requested data to the processor can
vary: due to the memory hierarchy.
• Performance of a cache is largely determined by:
– Cache miss rate: number of cache misses divided by
number of accesses.
– Cache hit time: the time between sending address and data
returning from cache.
– Cache miss penalty: the extra processor stall cycles caused
by access to the next-level cache.
• AMAT can be expressed as:
AMAT  Cache hit time  Miss rate Miss penalty
Impact of Memory System on
Processor Performance
CPU Performance with Memory Stall= CPI without stall +
Memory Stall CPI
Memory Stall CPI
= Miss per inst × miss penalty
= % Memory Access/Instr × Miss rate × Miss
Penalty
Example: Assume 20% memory acc/instruction,
2% miss rate, 400-cycle miss penalty. How
much is memory stall CPI?
Memory Stall CPI= 0.2*0.02*400=1.6 cycles
CPU Performance with Memory Stall
CPU Performance with Memory Stall= CPI without stall
+ Memory Stall CPI

CPU time IC CPIexecution  CPImem_stall Cycle Time
CPImem_stall = Miss per inst × miss penalty
CPImem_stall  Memory Inst Frequency Miss Rate Miss Penalty
Performance Example 1
• Suppose:
–Clock Rate = 200 MHz (5 ns per cycle), Ideal (no
misses) CPI = 1.1
–50% arith/logic, 30% load/store, 20% control
–10% of data memory operations get 50 cycles miss
penalty
–1% of instruction memory operations also get 50
cycles miss penalty
• Compute AMAT
Performance Example 1
• CPI = ideal CPI + average stalls per instructioncont…
= 1.1(cycles/ins) + [ 0.30 (DataMops/ins)

x 0.10 (miss/DataMop) x 50 (cycle/miss)]
+[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
• AMAT=(1/1.3)x[1.1+0.01x50]+(0.3/1.3)x[1.1+0.1x50]
=2.63
Example 2
•Assume 20% Load/Store instructions
•Assume CPI without memory stalls is 1
•Cache hit time = 1 cycle
•Cache miss penalty = 100 cycles
•Miss rate = 1%
What is: stall cycles per instruction?
average memory access time?
CPI with and without cache?
Example 2: Answer
• Average memory accesses per
instruction = 1.2
• AMAT = 1 + 1.2*0.01*100= 2.2 cycles
• Stall cycle = 1.2 cycles
• CPI with cache = 1+1.2=2.2
• CPI without cache=1+1.2*100=121
Unified vs Split Caches
●
Separate Instruction and Data caches:
– Avoids structural hazard
– Also each cache can be tailored specific to need.
Processor Processor
I-Cache-1 D-Cache-1
Unified
Cache-1
Unified
Cache-2
Unified
Cache-2
Unified Cache Split Cache

Example 3
• Which has a lower miss rate?
– A split cache (16KB instruction cache +16KB Data
cache) or a 32 KB unified cache?
• Compute the respective AMAT also.
• 40% Load/Store instructions
• Hit time = 1 cycle
• Miss penalty = 100 cycles
• Simulator showed:
– 40 misses per thousand instructions for data cache
– 4 misses per thousand instr for instruction cache
– 44 misses per thousand instr for unified cache
Example 3: Answer
• Miss rate = (misses/instructions)/(mem
accesses/instruction)
• Instruction cache miss rate= 4/1000=0.004
• Data cache miss rate = (40/1000)/0.4 =0.1
• Unified cache miss rate = (44/1000)/1.4 =0.04
• Overall miss rate for split cache =
0.3*0.1+0.7*0.004=0.0303
Example 3: Answer
cont…
• AMAT (split cache)=

0.7*(1+0.004*100)+0.3(1+0.1*100)=4.3
• AMAT (Unified)=
0.7(1+0.04*100)+0.3(1+1+0.04*100)=4.5
Example 4
– Assume 16KB Instruction and Data Cache:
– Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
– Assume 33% data ops  75% accesses from instructions
(1.0/1.33)
– hit time=1, miss penalty=50
– Data hit has 1 additional stall for unified cache
– Which is better (ignore L2 cache)?
AMATSplit =75%x(1+0.64%x50)+25%x(1+6.47%x50)
= 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)
= 2.24
Example 5
• What is the impact of 2 different cache
organizations on the performance of CPU?
• Clock cycle time 1nsec
• 50% load/store instructions
• Size of both caches 64KB:
– Both caches have block size 64KB
– one is direct mapped the other is 2-way sa.
• Cache miss penalty=75 ns for both caches
• Miss rate DM= 1.4% Miss rate SA=1%
• CPU cycle time must be stretched 25% to
accommodate the multiplexor for the SA
Example 5: Solution
• AMAT DM= 1+(0.014*75)=2.05nsec
• AMAT SA=1*1.25+(0.01*75)=2ns
• CPU Time= IC*(CPI+(Misses/Instr)*Miss
Penalty)* Clock cycle time
• CPU Time DM=
IC*(2*1.0+(1.5*0.014*75)=3.58*IC
• CPU Time SA=
IC*(2*1.25+(1.5*0.01*75)=3.63*IC
Cache Optimizations
Cache Optimizations
How to Improve Cache Performance?
AMAT  HitTime  MissRate MissPenalty
1. Reduce miss rate
– Larger block size
– Larger cache size
3. Reduce hit time
– Higher associativity
-- Avoiding address
translation
2. Reduce miss penalty
– Multilevel caches
– Read miss first
Cache Optimizations
Causes Of Misses
• To be able to reduce miss rate, we model that

sorts all misses into 3Cs:
Compulsory—To bring blocks into cache for the first
time.
• Also called cold start misses or first reference misses.
• Misses in even an Infinite Cache.
Capacity—Cache is not large enough, some blocks are

discarded and later retrieved.
• Misses even in Fully Associative cache.
Cache Optimizations
Causes Of Misses
Conflict—Blocks can be discarded and later retrieved if

too many blocks map to a set.
• Also called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
• Later we shall discuss a 4th “C”:
Coherence - Misses caused by cache coherence.
To be discussed in multiprocessors part.
Larger block size(1)
to Reduce miss rate(1)
Larger block sizes -
• will reduce also compulsory misses (due to- temporal
and spatial locality).
• take advantage of spatial locality.
• increase the miss penalty (reducing no of blocks in the
cache)
• increase conflict misses and even capacity misses if
the cache is small.
Larger block size(1) to Reduce
miss rate(1) Contd…
• Clearly, there is little reason to increase the block size
to such a size that it increases the miss rate.
• There is also no benefit to reducing miss rate if it
increases the average memory access time.
• The increase in miss penalty may outweigh the
decrease in miss rate.
• High latency and high bandwidth encourage large
block size since the cache gets many more bytes per
miss for a small increase in miss penalty.
Larger block size(1) to
Reduce miss rate(1) Contd…
Larger Caches(2) to
Reduce Miss Rate(1)
• The obvious way to reduce capacity misses is to
increase capacity of the cache.
• The obvious drawback is potentially longer hit
time and higher cost and power. This technique
has been especially popular in off-chip caches.
Higher Associativity(3) to Reduce
Miss Rate(1)
• miss rates improve with higher associativity.
• There are two general rules of thumb that can
be gleaned from many observations.
• The first is that eight-way set associative as
effective in reducing misses as fully associative.
• The second is that a direct mapped cache of
size N has about the same miss rate as a two-
way set-associative cache of size N/2(called 2:1
cache rule ).
Miss Rate(1)
• So, improving one aspect of the average
memory access time comes at the expense of
another.
• greater associativity can come at the cost of
increased hit time.
Miss Rate(1)
• Assume higher associativity would increase the clock
cycle time as listed below:
• Clock cycle time2-way = 1.36 × Clock cycle time1-way
• hit time is 1 clock cycle, that the miss penalty for the
direct mapped cache is 25 clock cycles to a level 2
cache (see next subsection) that
Miss Rate(1)
• for which cache sizes are each of these three
statements true?
• Average memory access time8-way < Average
memory access time4-way
Miss Rate(1)
• Answer Average memory access time for
each associativity is
• Average memory access time8-way = Hit
time8-way + Miss rate8-way × Miss penalty8-
way = 1.52 + Miss rate8-way × 25
• Average memory access time4-way = 1.44 +
Miss rate4-way × 25
• Average memory access time2-way = 1.36 +
Miss rate2-way × 25
Multi-Level Cache(1) to Reduce
Miss Penalty(2)
• Add a second-level cache.
• L2 Equations:
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 +
Miss RateL2 × Miss PenaltyL2)
Multi-Level Cache: Some Definitions
• Local miss rate— misses in this cache divided
by the total number of memory accesses to
this cache (Miss rateL2)
• Global miss rate—misses in this cache divided
by the total number of memory accesses
generated by the CPU
• L1 Global miss rate = L1 Local miss rate
• but, for 2nd level cache
• Local Miss RateL1 x Local Miss RateL2
Global vs. Local Miss Rates
• At lower level caches (L2 or L3), global
miss rates provide more useful
information:
– Indicate how effective is cache in reducing
AMAT.
– Who cares if the miss rate of L3 is 50% as
long as only 1% of processor memory
accesses ever benefit from it?
Performance Improvement Due to
L2 Cache: Example 6
Assume:
• For 1000 memory references :
– 40 misses in L1,
– 20 misses in L2
• L1 hit time: 1 cycle,
• L2 hit time: 10 cycles,
• L2 miss penalty=200
• 1.5 memory references per instruction
• Assume ideal CPI=1.0
Find: AMAT, Average stall cycles per instruction
Example 6: Solution
• Average memory access time = Hit timeL1 + Miss
rateL1 × (Hit timeL2 + Miss rateL2 × Miss penaltyL2)
= 1 + 4% × (10 + 50% × 200) = 1 + 4% × 110 = 5.4
clock cycles
• To see how many misses we get per instruction, we
divide 1000/1.5 yields 667 instructions.
• Thus, the number of misses per 1000 instructions. We
have 40 × 1.5 or 60 L1 misses, and 20 × 1.5 or 30 L2
misses, per 1000 instructions.
Example 6: Solution
• Average memory stalls per instruction = Misses per
instructionL1 × Hit timeL2 + Misses per instructionL2
× Miss penaltyL2
= (60/1000) × 10 + (30/1000) × 200
= 0.060 × 10 + 0.030 × 200 = 6.6 clock cycles
• If we subtract the L1 hit time from AMAT and then
multiply by the average number of memory
references per instruction, we get the same average
memory stalls per instruction: (5.4 – 1.0) × 1.5 = 4.4 ×
1.5 = 6.6 clock cycles
Multilevel Cache
• The speed (hit time) of L1 cache affects the
clock rate of CPU:
– Speed of L2 cache only affects miss penalty of L1.
• Inclusion Policy:
– Many designers keep L1 and L2 block sizes the
same.
– Otherwise on a L2 miss, several L1 blocks may
have to be invalidated.
Read Priority over Write on Miss(2) to
Reduce Miss Penalty (2)
• In a write-back scheme:
– Normally a dirty block is stored in a write
buffer temporarily.
– Usual:
• Write all blocks from the write buffer to
memory, and then do the read.
– Instead:
• Check write buffer first, if not found, then
initiate read.
• CPU stall cycles would be less.
SW R3, 512(R0) ;M[512] ← R3 (cache index 0)
LW R1, 1024(R0) ;R1 ← M[1024] (cache index 0)
LW R2, 512(R0) ;R2 ← M[512] (cache index 0)
• Assume a direct-mapped, write-through cache that

maps 512 and 1024 to the same block, and a four-
word write buffer that is not checked on a read miss.
• Will the value in R2 always be equal to the value in

R3?
• A write buffer with a write through:
– Allows cache writes to occur at the speed of
the cache.
• Write buffer however complicates memory
access:
– They may hold the updated value of a
location needed on a read miss.
• Write-through with write buffers:
–Read priority over write: Check write buffer
contents before read;
if no conflicts, let the memory access continue.
–Write priority over read: Waiting for write
buffer to first empty, can increase read miss
penalty.
HPCA
Multiprocessor
A Broad Classification of Computers
• Shared-memory multiprocessors
– Also called UMA
• Distributed memory computers
– Also called NUMA:
• Distributed Shared-memory (DSM)
architectures
• Clusters
• Grids, etc.
UMA vs. NUMA Computers
Latency = several
milliseconds to seconds
P1 P2 Pn P1 P2 Pn
Cache Cache Cache Cache Cache Cache
Bus
Main Main Main
Memory Memory Memory
Main
Memory
Network
Latency = 100s of ns
(a) UMA Model (b) NUMA Model

Distributed Memory Computers
• Distributed memory computers use:
–Message Passing Model
• Explicit message send and receive instructions
have to be written by the programmer.
–Send: specifies local buffer + receiving process (id)
on remote computer (address).
–Receive: specifies sending process on remote
computer + local buffer to place data.
Advantages of Message-Passing
Communication
• Hardware for communication and
synchronization are much simpler:
–Compared to communication in a shared memory
model.
• Explicit communication:
–Programs simpler to understand, helps to reduce
maintenance and development costs.
• Synchronization is implicit:
–Naturally associated with sending/receiving
messages.
–Easier to debug.
Disadvantages of Message-Passing
Communication
• Programmer has to write explicit message
passing constructs.
– Also, precisely identify the processes (or
threads) with which communication is to
occur.
• Explicit calls to operating system:
– Higher overhead.
DSM
• Physically separate memories are accessed
as one logical address space.
• Processors running on a multi-computer
system share their memory.
– Implemented by operating system.
• DSM multiprocessors are NUMA:
– Access time depends on the exact location of
the data.
Distributed Shared-Memory
Architecture (DSM)
• Underlying mechanism is message passing:
– Shared memory convenience provided to the
programmer by the operating system.
– Basically, an operating system facility takes care
of message passing implicitly.
• Advantage of DSM:
– Ease of programming
Disadvantage of DSM
• High communication cost:
– A program not specifically optimized for
DSM by the programmer shall perform
extremely poorly.
– Data (variables) accessed by specific
program segments have to be collocated.
– Useful only for process-level (coarse-
grained) parallelism.
Symmetric Multiprocessors (SMPs)
• SMPs are a popular shared memory
multiprocessor architecture:
–Processors share Memory and I/O
–Bus based: access time for all memory locations is
equal --- “Symmetric MP”
P P P P
Cache Cache Cache Cache
Bus
Main memory I/O system

SMPs: Some Insights
• In any multiprocessor, main memory access is
a bottleneck:
–Multilevel caches reduce the memory demand of a
processor.
–Multilevel caches in fact make it possible for more
than one processor to meaningfully share the
memory bus.
–Hence multilevel caches are a must in a
multiprocessor!
Pros of SMPs
• Ease of programming:
–Especially when communication
patterns are complex or vary
dynamically during execution.
Cons of SMPs
• As the number of processors increases,
contention for the bus increases.
– Scalability of the SMP model restricted.
– One way out may be to use switches (crossbar,
multistage networks, etc.) instead of a bus.
– Switches set up parallel point-to-point
connections.
– Again switches are not without any
disadvantages: make implementation of cache
coherence difficult.
An Important Problem with
Shared-Memory: Coherence
• When shared data are cached:
– These are replicated in multiple caches.
– The data in the caches of different
processors may become inconsistent.
• How to enforce cache coherency?
– How does a processor know changes in the
caches of other processors?
The Cache Coherency
Problem
5
P1 4 P2 P3
1 U:? U:? 3 U:7 3
U:5 U: U:5
?
1 2
U:5
What value will P1 and P2 read?

Cache Coherence Solutions
(Protocols)
• The key to maintain cache coherence:
– Track the state of sharing of every data
block.
• Based on this idea, following can be an
overall solution:
– Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.
Pros and Cons of the Solution
• Pro:
–Consistency maintenance becomes
transparent to programmers,
compilers, as well as to the operating
system.
• Con:
–Increased hardware complexity .
Two Important Cache Coherency
Protocols
• Snooping protocol:
– Each cache “snoops” the bus to find out which
data is being used by whom.
• Directory-based protocol:
– Keep track of the sharing state of each data
block using a directory.
– A directory is a centralized register for all
memory blocks.
– Allows coherency protocol to avoid broadcasts.
Snooping vs. Directory-based
Protocols
• Snooping protocol reduces memory traffic.
– More efficient.
• Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
– Even when there is a shared bus, scalability is a
problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses.
Snooping Protocol
• As soon as a request for any data block by a
processor is put out on the bus:
–Other processors “snoop” to check if they have a
copy and respond accordingly.
• Works well with bus interconnection:
–All transmissions on a bus are essentially broadcast:
• Snooping is therefore effortless.
–Dominates almost all small scale machines.
Categories of Snoopy
Protocols
• Essentially two types:
–Write Invalidate Protocol
–Write Broadcast Protocol
• Write invalidate protocol:
–When one processor writes to its cache, all other
processors having a copy of that data block
invalidate that block.
• Write broadcast:
–When one processor writes to its cache, all other
processors having a copy of that data block
update that block with the recent written value.
Write Invalidate Vs. Write Update
Protocols
P P P P
Cache Cache Cache Cache
Bus
Main memory I/O system

Write Invalidate Protocol
• Handling a write to shared data:
– An invalidate command is sent on bus --- all
caches snoop and invalidate any copies they
have.
• Handling a read Miss:
– Write-through: memory is always up-to-date.
– Write-back: snooping finds most recent copy.
Write Invalidate in Write Through
Caches
• Simple implementation.
• Writes:
– Write to shared data: broadcast on bus, processors
snoop, and update any copies.
– Read miss: memory is always up-to-date.
• Concurrent writes:
– Write serialization automatically achieved since bus
serializes requests.
– Bus provides the basic arbitration support.
Write Invalidate versus
Broadcast cont…
• Invalidate exploits spatial locality:
–Only one bus transaction for any number of
writes to the same block.
–Obviously, more efficient.
• Broadcast has lower latency for writes and reads:
–As compared to invalidate.
An Example Snoopy Protocol
• Assume:
–Invalidation protocol, write-back cache.
• Each block of memory is in one of the
following states:
–Shared: Clean in all caches and up-to-date in
memory, block can be read.
–Exclusive: cache has the only copy, it is writeable,
and dirty.
–Invalid: Data present in the block obsolete,
cannot be used.
Cache Coherence Protocols
Implementation of the Snooping
Protocol
• A cache controller at every processor would
implement the protocol:
– Has to perform specific actions:
• When the local processor requests certain things.
• Also, certain actions are required when certain
address appears on the bus.
– Exact actions of the cache controller depends on
the state of the cache block.
– Two FSMs can show the different types of actions
to be performed by a controller.
Snoopy-Cache State Machine-I
CPU Read hit
• State machine
considering only CPU
requests CPU Read Share
for each cache Invalid d
Place read miss
block. on bus (read/
only)
CPU Write CPU read miss
Write back block, CPU Read miss
Place Write Place read miss
Place read miss
Miss on bus on bus
on bus
CPU Write
Place Write Miss on Bus
Exclusi
CPU read hit ve CPU Write Miss
CPU write hit (read/ Write back cache block
write) Place write miss on bus
Snoopy-Cache State Machine-II
• State machine
considering only Write miss
Share
bus requests Invalid for this block
for each cache d
block. (read/
only)
Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block;
(abort
Exclusi
memory
ve
access)
(read/
write)
Combined Snoopy-Cache State
• State machine Machine CPU Read hit
considering both CPU Write miss

requests for this block Share
CPU Read
and bus requests Invalid d
Place read miss
for each CPU Write on bus (read/
cache block. Place Write only)
Write miss Miss on bus
for this block CPU read miss CPU Read miss
Write back block, Place read miss
Write Back
Place read miss on bus
Block; Abort CPU Write
on bus
memory Place Write Miss on Bus
access. Write Back
Read miss
Block; (abort
for this block
Exclusi memory
access)
CPU read hit ve CPU Write Miss
CPU write hit (read/ Write back cache block
write) Place write miss on bus
Directory-based Solution
• In NUMA computers:
– Messages have long latency.
– Also, broadcast is inefficient --- all messages have
explicit responses.
• Main memory controller to keep track of:
– Which processors are having cached copies of
which memory locations.
• On a write,
– Only need to inform users, not everyone
• On a dirty read,
– Forward to owner
Directory Protocol
• Three states as in Snoopy Protocol
–Shared: 1 or more processors have data, memory is
up-to-date.
–Uncached: No processor has the block.
–Exclusive: 1 processor (owner) has the block.
• In addition to cache state,
–Must track which processors have data when in the
shared state.
–Usually implemented using bit vector, 1 if processor
has copy.
Directory Behavior
• On a read:
– Unused:
• give (exclusive) copy to requester
• record owner
– Exclusive or shared:
• send share message to current exclusive
owner
• record owner
• return value
– Exclusive dirty:
Directory Behavior
• On Write
– Send invalidate messages to all hosts
caching values.
• On Write-Thru/Write-back
– Update value.
CPU-Cache State Machine
Invalidate CPU Read hit
• State machine or Miss due to
for CPU requests Uncacheed address Share
for each conflict:
CPU Read
d
memory block Send Read Miss (read/
• Invalid state message only)
CPU Write:
if in Fetch/Invalidate Send Write Miss CPU Write:
Send
memory or Miss due to msg to h.d.
Write Miss message
address conflict:
end Data Write Back message to home directory
to home directory
Fetch: send
Data Write Back
Exclusi message
ve to home directory
CPU read hit
CPU write hit (read/
write)
Directory State Machine
Read miss:
Sharers += {P};
• State machine Read miss: send Data Value Rep
for Directory requests Sharers = {P}
for each send Data Value
Shared
memory block Uncached Reply (read
only)
• Uncached state
if in memory Write Miss:
Write Miss:
Sharers = {P};
Data Write Back: send Invalidate
send Data
Sharers = {} to Sharers;
Value Reply
(Write back block) then Sharers = {P};
msg
send Data Value
Reply msg
Read miss:
Sharers += {P};
Write Miss: Exclusive send Fetch;
Sharers = {P}; (read/ send Data Value
send write) Reply
Fetch/Invalidate; msg to remote
send Data Value cache
Superscalar Processors
&
VLIW Processors
Topics to be covered
• Introduction to Super scalar Processor.
• Architecture of Superscalar Processor.
• VLIW Processor.
• Architecture of VLIW Processor.
• Difference between Superscalar and VLIW
processor.
• A Superscalar machine executes
multiple independent instructions in
parallel.
• They are pipelined as well.
• “Common” instructions (arithmetic, load/store,
conditional branch) can be executed
independently.
• The order of execution is usually assisted by the
compiler.
Super pipelined Processor
• In traditional pipelined system has a single
pipeline stage for each sub-operation and it has
to pass through a dedicated segment.
• Where as A super pipelined processor has a
pipeline where each of these logical steps may
be subdivided into multiple pipeline stages.
Superscalar v Super-pipelined
• A more aggressive approach to achieve
parallelism is to equip the processor with
multiple processing units to handle several
instructions in parallel in each processing stage.
• Such processors are capable of achieving an
instruction execution throughput of more than
one instruction per cycle. These processors are
known as superscalar processors.
• In superscalar processor the instruction queue
has to be remain filled.
• Multiple issue operation requires a wider path
to the cache and multiple execution units.
• Separate execution units are provided for
integer and floating-point instructions.
F : Instruction
fetch unit
Instruction queue
Floating-
point
unit
Dispatch
unit W : Write
results
Integer
unit
Fig u re 8 .1 9 . A p ro cesso r with two ex ecu tio n u n its.

Working Principle
• The IF unit is capable of reading two
instructions at a time & storing them in the
instruction queue.
• In each clock cycle the Dispatch unit retrieves
and decodes up to two instructions from the
front of the queue.
• If there is one integer and one floating point
instruction and no hazards, both instructions
are dispatched in the same clock cycle.
• Out of order execution may lead to exception again
which may cause inconsistency to the program.
• Exceptions: Two types—Imprecise and Precise.
• Imprecise: Let i1 and i2 are two instructions
issued at the same time (clock cycle). i1 causes an
exception which leads the program to inconsistency
situation. While i2 has completed the WB operation. If
such situation is permitted, then such type of
exception is known as Imprecise Exception.
• To achieve consistency in the program, writing in to
the destination must be followed in the program
instruction order. i.e. in order.
• Precise Exception: If an exception occurs
during an instruction execution all subsequent
instructions that may have been partially
executed are discarded. This is called precise
exception.
Execution Completion
• Out of order execution is desirable to free execution unit
for other instructions.
• Instruction must be completed in program order to allow
precise exceptions.
• Both the above requirements are conflicting to each
other.
• The above problem can be resolved if execution is allowed
to proceed but the results are written in to the temporary
registers. Latter transferred in to the destination register
in the correct program order.
• The above step is called commitment step.
• When, out of order execution is allowed a
special control unit is needed to guarantee in-
order commitment. This is called commitment
unit.
Dispatch Operation
Should instructions be dispatched out of order?
• Ensure that there is no possibility of deadlock
occurring. If instructions are dispatched out of
order, a deadlock can arise as follows.
• Suppose that the processor has only one
temporary register, and that when I5 is dispatched ,
that register is reserved for it. Instruction I4 can
not be dispatched because it is waiting for the
temporary register, which in turn will become free
until I5 is retired. Since I5 can not be retired before I4,
we have a deadlock.
Issues related to Superscalar Processor
• Dependent upon:
- Instruction level parallelism possible
- Compiler based optimization
- Hardware support
• Limited by
– Data dependency
– Procedural dependency
– Resource conflicts
VLIW Processor
Basic Working Principles of VLIW
• Aim at speeding up computation by exploiting
instruction-level parallelism.
• Same hardware core as superscalar
processors, having multiple execution units
(EUs) working in parallel.
• An instruction is consisted of multiple
operations; typical word length from 52 bits to
1 Kbits.
• All operations in an instruction are executed in
a lock-step mode.
• Rely on compiler to find parallelism and
schedule dependency free program code.
Basic VLIW Approach
Register File Structure for VLIW
Differences Between VLIW & Superscalar
Architecture (I)
Differences Between VLIW & Superscalar
Architecture (II)
• Instruction formulation:
– Superscalar:
• Receive conventional instructions conceived for seq. processors.
– VLIW:
• Receive (very) long instruction words, each comprising a field (or
opcode) for each execution unit.
• Instruction word length depends (a) number of execution units,
and (b) code length to control each unit (such as opcode length,
register names, …).
• Typical word length is 64 – 1024 bits, much longer than
conventional machine word length.
• Instruction scheduling:
– Superscalar:
• Done dynamically at run-time by the hardware.
• Data dependency is checked and resolved in hardware.
• Need a look ahead hardware window for instruction fetch.
– VLIW:
• Static scheduling done at compile-time by the compiler.
• Advantages:
– Reduce hardware complexity.
– Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple.
– Potentially higher clock rate.
– Higher degree of parallelism with global program
information.
Avoiding address translation to
Reduce hit time(3)

Full Notes

Uploaded by

Copyright:

Available Formats

Full Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Full Notes

Uploaded by

Copyright:

Available Formats

High Performance Computer

Dr. A.K. Sahoo

• Figure 1.1 shows that the combination of

enhancements led to 16 years of sustained growth in

• A rate that is unprecedented in the computer

• Since 2002, processor performance improvement has

• Due to the triple hurdles of maximum power

• Thread-level parallelism (TLP) and

• The 80x86 supports those three plus three

MIMD (Multiple Instruction Multiple Data):

Contr IS Process DS2 Memory

• Instruction can be decoded in parallel with

• Under these conditions, the speedup from

• The previous expression is ideal. We will see

• EX: If each instruction in a microprocessor

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

Program Flow IFetch Dcd Exec Mem WB

Besides, pipelining itself involves some

Ready S1 Ready S2 Ready Sk Ready

- Different amounts of delay may be experienced at

- Can display variable throughput rate.

Maximum speedup = Sk  K ,for n >> K

for data to arrive from memory.

Maximum throughput = frequency of linear pipeline

• A third problem arises with the interaction of

Time Would there be a hazard here?

– Let’s examine by how much stalls can

= Avg Inst Time Unpipelined / Avg Inst Time

Speedup = CPI unpipelined x Clock cycle time unpipelined

Clock cycle unpipeline d

ExecutionT imeY  Executrion Time X

Performance X ExecutionT imeY

Thus, the performance ratio is 15/10=1.5

CPU execution time for a program

CPU clock cycles A

CPU clock cycles A

CPU clock cycles A 10 sec onds 2 10 9 cycles / sec

• We have a computer A with 250ps clock cycle

• A is 1.2 times as fast as B

– An ALU to perform an arithmetic operation

– Separate data cache and instruction cache

Stall Bubble Bubble Bubble Bubble Bubble

Time A Pipeline can be stalled by inserting a “bubble” or

Speedup = CPI no haz Clock cycle time no haz

• The ordering between the instructions must be

– Dependent instruction uses old data:

i: ADD R1, R2, R3 Instruction j is a Instruction i is a

R (I) ∩ D (J) ≠ Ø for RAW

D (I) ∩ R (J) ≠ Ø for WAR

R (I) ∩ R (J) ≠ Ø for WAW

RAW WAR WAW

True dependency False dependency

This is a data hazard

XOR R10, R1, R11 Mem Reg

ADD instruction causes a hazard in next 3

AND gets info.