0% found this document useful (0 votes)
61 views

Organization of Multiprocessor Systems

The document discusses Flynn's taxonomy for classifying computer architectures based on their instruction and data streams. It describes the four categories in Flynn's taxonomy: single instruction, single data (SISD); single instruction, multiple data (SIMD); multiple instruction, multiple data (MIMD); and multiple instruction, single data (MISD). It then discusses different multiprocessor system topologies like shared bus, ring, tree, mesh, hypercube, and completely connected. Finally, it covers uniform memory access (UMA) and non-uniform memory access (NUMA) architectures.

Uploaded by

bijan shrestha
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Organization of Multiprocessor Systems

The document discusses Flynn's taxonomy for classifying computer architectures based on their instruction and data streams. It describes the four categories in Flynn's taxonomy: single instruction, single data (SISD); single instruction, multiple data (SIMD); multiple instruction, multiple data (MIMD); and multiple instruction, single data (MISD). It then discusses different multiprocessor system topologies like shared bus, ring, tree, mesh, hypercube, and completely connected. Finally, it covers uniform memory access (UMA) and non-uniform memory access (NUMA) architectures.

Uploaded by

bijan shrestha
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 87

Organization of Multiprocessor

Systems
 Flynn’s Classification
 Was proposed by researcher Michael J. Flynn in 1966.
 It is the most commonly accepted taxonomy of
computer organization.
 In this classification, computers are classified by
whether it processes a single instruction at a time or
multiple instructions simultaneously, and whether it
operates on one or multiple data sets.

1
Taxonomy of Computer
Architectures
Simple Diagrammatic Representation

4 categories of Flynn’s classification of multiprocessor systems by their instruction and data


streams
2
Single Instruction, Single Data
(SISD)
 SISD machines executes a single instruction on
individual data values using a single processor.
 Based on traditional Von Neumann uniprocessor
architecture, instructions are executed sequentially
or serially, one step after the next.
 Until most recently, most computers are of SISD
type.

3
SISD
Simple Diagrammatic Representation

4
Single Instruction, Multiple
Data (SIMD)
 An SIMD machine executes a single instruction
on multiple data values simultaneously using
many processors.
 Since there is only one instruction, each processor
does not have to fetch and decode each
instruction. Instead, a single control unit does the
fetch and decoding for all processors.
 SIMD architectures include array processors.

5
SIMD
Simple Diagrammatic Representation

6
Multiple Instruction, Multiple
Data (MIMD)
 MIMD machines are usually referred to as
multiprocessors or multicomputers.
 It may execute multiple instructions
simultaneously, contrary to SIMD machines.
 Each processor must include its own control unit
that will assign to the processors parts of a task or
a separate task.
 It has two subclasses: Shared memory and
distributed memory

7
MIMD
Simple Diagrammatic Representation Simple Diagrammatic
(Shared Memory) Representation(DistributedMemory)

8
Multiple Instruction, Single
Data (MISD)
 This category does not actually exist. This
category was included in the taxonomy for
the sake of completeness.

9
Analogy of Flynn’s
Classifications
 An analogy of Flynn’s classification is the
check-in desk at an airport
 SISD: a single desk
 SIMD: many desks and a supervisor with a
megaphone giving instructions that every desk
obeys
 MIMD: many desks working at their own pace,
synchronized through a central database

10
System Topologies
Topologies
 A system may also be classified by its topology.
 A topology is the pattern of connections
between processors.
 The cost-performance tradeoff determines
which topologies to use for a multiprocessor
system.

11
Topology Classification
 A topology is characterized by its diameter, total
bandwidth, and bisection bandwidth
– Diameter – the maximum distance between two
processors in the computer system.
– Total bandwidth – the capacity of a communications
link multiplied by the number of such links in the
system.
– Bisection bandwidth – represents the maximum data
transfer that could occur at the bottleneck in the
topology.
12
System Topologies
M M M
 Shared
Bus
Topology
– Processors communicate
with each other via a single P P P
bus that can only handle
one data transmissions at a
time.
– In most shared buses, Shared Bus
processors directly
communicate with their
own local memory. Global
memory
13
System Topologies
P
 Ring Topology
– Uses direct connections
between processors
instead of a shared bus. P P
– Allows communication
links to be active
simultaneously but data P P
may have to travel
through several processors
to reach its destination. P

14
System Topologies
 Tree Topology P
– Uses direct
connections between
processors; each P P
having three
connections.
– There is only one
unique path between P P P P
any pair of processors.

15
Systems Topologies
 Mesh Topology
P P P
– In the mesh topology,
every processor
connects to the
processors above and P P P
below it, and to its
right and left.
P P P

16
System Topologies
 Hypercube Topology
– Is a multiple mesh
topology. P P P P
– Each processor
connects to all other P P P P
processors whose
binary values differ by P P P P
one bit. For example,
processor 0(0000) P P P P
connects to 1(0001) or
2(0010).

17
System Topologies
 Completely
P P
Connected Topology
 Every processor has P P
n-1 connections, one to each of
the other processors. P P
 There is an increase in
complexity as the system P P
grows but this offers maximum
communication capabilities.
18
MIMD System Architectures

 Finally, the architecture of a MIMD system,


contrast to its topology, refers to its
connections to its system memory.
 A systems may also be classified by their
architectures. Two of these are:
 Uniform memory access (UMA)
 Nonuniform memory access (NUMA)

19
Uniform memory access
(UMA)
 The UMA is a type of symmetric
multiprocessor, or SMP, that has two or
more processors that perform symmetric
functions. UMA gives all CPUs equal
(uniform) access to all memory locations in
shared memory. They interact with shared
memory by some communications
mechanism like a simple bus or a complex
multistage interconnection network.
20
Uniform memory access
(UMA) Architecture
Processor 1

Processor 2
Communications Shared
mechanism Memory

Processor n

21
Nonuniform memory access
(NUMA)
 NUMA architectures, unlike UMA
architectures do not allow uniform access to
all shared memory locations. This
architecture still allows all processors to
access all shared memory locations but in a
nonuniform way, each processor can access
its local shared memory more quickly than
the other memory modules not next to it.
22
Nonuniform memory access
(NUMA) Architecture

Processor 1 Processor 2 Processor n

Memory 1 Memory 2 Memory n

Communications mechanism

23
What is Pipelining?

• A way of speeding up execution of instructions

• Key idea:
overlap execution of multiple instructions
The Laundry Analogy

• Ann, Brian, Cathy, Dave


each have one load of clothes A B C D
to wash, dry, and fold
• Washer takes 30 minutes

• Dryer takes 30 minutes

• “Folder” takes 30 minutes

• “Stasher” takes 30 minutes


to put clothes into drawers
If we do laundry sequentially...

6 PM 7 8 9 10 11 12 1 2 AM

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T Time
a A
s
k
B
O
r
d C
e
r D
• Time Required: 8 hours for 4 loads
To Pipeline, We Overlap Tasks

6 PM 7 8 9 10 11 12 1 2 AM

30 30 30 30 30 30 30 Time
T
a
s A
k
O
B
r
d C
e
r D
• Time Required: 3.5 Hours for 4 Loads
To Pipeline, We Overlap Tasks

6 PM 7 8 9 10 11 12 1 2 AM

30 30 30 30 30 30 30 Time
T
a • Pipelining doesn’t help latency of
s A single task, it helps throughput of
k entire workload
O
B • Pipeline rate limited by slowest
r pipeline stage
d C
e • Multiple tasks operating
r D simultaneously
• Potential speedup = Number
pipe stages
• Unbalanced lengths of pipe
stages reduces speedup
• Time to “fill” pipeline and time to
“drain” it reduces speedup
Pipelining a Digital System
1 nanosecond = 10^-9 second
1 picosecond = 10^-12 second

• Key idea: break big computation up into pieces

1ns
• Separate each piece with a pipeline register

200ps 200ps 200ps 200ps 200ps

Pipeline
Register
Pipelining a Digital System

• Why do this? Because it's faster for repeated


computations

Non-pipelined:
1 operation finishes
every 1ns

1ns

Pipelined:
1 operation finishes
every 200ps

200ps 200ps 200ps 200ps 200ps


Comments about pipelining

• Pipelining increases throughput, but not


latency
– Answer available every 200ps, BUT
– A single computation still takes 1ns
• Limitations:
– Computations must be divisible into stage size
– Pipeline registers add overhead
Pipelining a Processor

• Recall the 5 steps in instruction execution:


1. Instruction Fetch (IF)
2. Instruction Decode and Register Read (ID)
3. Execution operation or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)

• Review: Single-Cycle Processor


– All 5 steps done in a single clock cycle
– Dedicated hardware required for each step
Review - Single-Cycle Processor

•What do we need to add to actually split the datapath into stages?


The Basic Pipeline For MIPS

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
Ifetch Reg DMem Reg
I
n
s

ALU
t Ifetch Reg DMem Reg

r.

ALU
O Ifetch Reg DMem Reg

r
d

ALU
e Ifetch Reg DMem Reg

What do we need to add to actually split the datapath into stages?


Basic Pipelined Processor
Single-Cycle vs. Pipelined Execution

Non-Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600 1800
Order Time
Instruction REG REG
lw $1, 100($0) ALU MEM
Fetch RD WR
Instruction REG REG
lw $2, 200($0) Fetch
ALU MEM
RD WR
800ps
Instruction
lw $3, 300($0)
Fetch
800ps
800ps
Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600
Order Time
Instruction REG REG
lw $1, 100($0) ALU MEM
Fetch RD WR
Instruction REG REG
lw $2, 200($0) Fetch
ALU MEM
RD WR
200ps
Instruction REG REG
lw $3, 300($0) ALU MEM
Fetch RD WR
200ps
200ps 200ps 200ps 200ps 200ps
Speedup
• Consider the unpipelined processor introduced previously. Assume that
it has a 1 ns clock cycle and it uses 4 cycles for ALU operations and
branches, and 5 cycles for memory operations, assume that the relative
frequencies of these operations are 40%, 20%, and 40%, respectively.
Suppose that due to clock skew and setup, pipelining the processor
adds 0.2ns of overhead to the clock. Ignoring any latency impact, how
much speedup in the instruction execution rate will we gain from a
pipeline?

Average instruction execution time


= 1 ns * ((40% + 20%)*4 + 40%*5)
= 4.4ns

Speedup from pipeline


= Average instruction time unpiplined/Average instruction time pipelined
= 4.4ns/1.2ns = 3.7
Comments about Pipelining

• The good news


– Multiple instructions are being processed at same time
– This works because stages are isolated by registers
– Best case speedup of N
• The bad news
– Instructions interfere with each other - hazards
» Example: different instructions may need the same
piece of hardware (e.g., memory) in same clock cycle
» Example: instruction may require a result produced
by an earlier instruction that is not yet complete
What Is
Pipelining MIPS Functions
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Access Back
Calc Passed To Next Stage
IR <- Mem[PC]
NPC <- PC + 4
IR L
M
D

Instruction Fetch (IF):


Send out the PC and fetch the instruction from memory into the instruction
register (IR); increment the PC by 4 to address the next sequential
instruction.
IR holds the instruction that will be used in the next stage.
NPC holds the value of the next PC.

Appendix A - Pipelining 39
What Is
Pipelining MIPS Functions
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Access Back
Calc Passed To Next Stage
A <- Regs[IR6..IR10];
B <- Regs[IR10..IR15];
IR L
Imm <- ((IR16) ##IR16-31
M
D

Instruction Decode/Register Fetch Cycle (ID):


Decode the instruction and access the register file to read the registers.
The outputs of the general purpose registers are read into two temporary
registers (A & B) for use in later clock cycles.
We extend the sign of the lower 16 bits of the Instruction Register.

Appendix A - Pipelining 40
What Is
Pipelining MIPS Functions
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Access Back
Calc Passed To Next Stage
A <- A func. B
cond = 0;
IR L
M
D

Execute Address Calculation (EX):


We perform an operation (for an ALU) or an address calculation (if it’s a load
or a Branch).
If an ALU, actually do the operation. If an address calculation, figure out
how to obtain the address and stash away the location of that address for
the next cycle.
Appendix A - Pipelining 41
What Is
Pipelining MIPS Functions
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Access Back
Calc Passed To Next Stage
A = Mem[prev. B]
or
IR L
Mem[prev. B] = A
M
D

MEMORY ACCESS (MEM):


If this is an ALU, do nothing.
If a load or store, then access memory.

Appendix A - Pipelining 42
What Is
Pipelining MIPS Functions
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Access Back
Calc Passed To Next Stage
Regs <- A, B;
IR L
M
D

WRITE BACK (WB):


Update the registers from either the ALU or from the data loaded.

Appendix A - Pipelining 43
Pipeline Hazards
• Limits to pipelining: Hazards prevent
next instruction from executing during
its designated clock cycle
– Structural hazards: two different instructions use same h/w in
same cycle
– Data hazards: Instruction depends on result of prior instruction
still in the pipeline
– Control hazards: Pipelining of branches & other instructions that
change the PC
Summary - Pipelining
Overview
• Pipelining increase throughput (but
not latency)
• Hazards limit performance
– Structural hazards
– Control hazards
– Data hazards
Pipeline Hurdles
Definition
• conditions that lead to incorrect behavior if not fixed
• Structural hazard
– two different instructions use same h/w in same cycle
• Data hazard
– two different instructions use same storage
– must appear as if the instructions execute in correct order
• Control hazard
– one instruction affects which instruction is next

Resolution
• Pipeline interlock logic detects hazards and fixes them
• simple solution: stall ­
• increases CPI, decreases performance
• better solution: partial stall ­
• some instruction stall, others proceed better to stall early than late

Appendix A - Pipelining 46
Structural Hazards
When two or more
Time (clock cycles) different
instructions want to
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 use same hardware
resource in same
I Load Ifetch

ALU
Reg DMem Reg cycle
n
e.g., MEM uses the
s Instr 1

ALU
Ifetch Reg DMem Reg
same memory port
t as IF as shown in
r. this slide.

ALU
Ifetch Reg DMem Reg
Instr 2
O

ALU
r Instr 3 Ifetch Reg DMem Reg

d
e

ALU
Ifetch Reg DMem Reg
Instr 4
r

Figure 3.6
Appendix A - Pipelining 47
Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 This is another
way of looking
at the effect of a
I Load Ifetch

ALU
Reg DMem Reg
stall.
n
s

ALU
Reg
t Instr 1
Ifetch Reg DMem

r.

ALU
Ifetch Reg DMem Reg
Instr 2
O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r

ALU
Ifetch Reg DMem Reg
Instr 3

Figure 3.7
Appendix A - Pipelining 48
Structural Hazards

This is another way to represent the stall we saw on the


last few pages.

Appendix A - Pipelining 49
Structural Hazards
Dealing with Structural Hazards
Stall
• low cost, simple
• Increases CPI
• use for rare case since stalling has performance effect
Pipeline hardware resource
• useful for multi-cycle resources
• good performance
• sometimes complex e.g., RAM
Replicate resource
• good performance
• increases cost (+ maybe interconnect delay)
• useful for cheap or divisible resources

Appendix A - Pipelining 50
Structural Hazards
Structural hazards are reduced with these rules:
• Each instruction uses a resource at most once
• Always use the resource in the same pipeline stage
• Use the resource for one cycle only
Many RISC ISA’a designed with this in mind
Sometimes very complex to do this. For example, memory of
necessity is used in the IF and MEM stages.

Some common Structural Hazards:


• Memory - we’ve already mentioned this one.
• Floating point - Since many floating point instructions require many
cycles, it’s easy for them to interfere with each other.
• Starting up more of one type of instruction than there are resources.
For instance, the PA-8600 can support two ALU + two load/store
instructions per cycle - that’s how much hardware it has available.

Appendix A - Pipelining 51
Structural Hazards
This is the example on Page 144.

We want to compare the performance of two machines. Which machine is faster?


• Machine A: Dual ported memory - so there are no memory stalls
• Machine B: Single ported memory, but its pipelined implementation has a 1.05
times faster clock rate
Assume:
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed

SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)


= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth / (0.75 x Pipeline Depth) = 1.33

• Machine A is 1.33 times faster

Appendix A - Pipelining 52
Data Hazards
A.1 What is Pipelining?
A.2 The Major Hurdle of Pipelining-
Structural Hazards
-- Structural Hazards
These occur when at any time, there are
– Data Hazards instructions active that need to access the
– Control Hazards same data (memory or register) locations.
A.3 How is Pipelining Implemented
Where there’s real trouble is when we have:
A.4 What Makes Pipelining Hard to
Implement?
instruction A
A.5 Extending the MIPS Pipeline to
Handle Multi-cycle Operations
instruction B

and B manipulates (reads or writes) data


before A does. This violates the order of the
instructions, since the architecture implies
that A completes entirely before B is executed.

Appendix A - Pipelining 53
Data Hazards
Execution Order is:
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
InstrI
InstrJ

I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler nomenclature).
This hazard results from an actual need for
communication.

Appendix A - Pipelining 54
Data Hazards
Execution Order is:
Write After Read (WAR)
InstrJ tries to write operand before InstrI reads i
InstrI
– Gets wrong operand
InstrJ

I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7

– Called an “anti-dependence” by compiler writers.


This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:


– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5

Appendix A - Pipelining 55
Data Hazards
Execution Order is:
Write After Write (WAW)
InstrJ tries to write operand before InstrI writes it
InstrI
– Leaves wrong result ( InstrI not InstrJ )
InstrJ

I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7

• Called an “output dependence” by compiler writers


This also results from the reuse of name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:


– All instructions take 5 stages, and
– Writes are always in stage 5

• Will see WAR and WAW in later more complicated pipes

Appendix A - Pipelining 56
Data Hazards
Simple Solution to RAW

• Hardware detects RAW and stalls


• Assumes register written then read each cycle
+ low cost to implement, simple
-- reduces IPC
• Try to minimize stalls

Minimizing RAW stalls

• Bypass/forward/short­circuit (We will use the word “forward”)


• Use data before it is in the register
+ reduces/avoids stalls
-- complex
• Crucial for common RAW hazards

Appendix A - Pipelining 57
Data Hazards
Time (clock cycles)
IF ID/RF EX MEM WB
I

ALU
Reg Reg
add r1,r2,r3Ifetch DMem

n
s

ALU
Ifetch Reg DMem Reg
t sub r4,r1,r3
r.

ALU
Ifetch Reg DMem Reg
and r6,r1,r7
O
r

ALU
Ifetch Reg DMem Reg

d or r8,r1,r9
e

ALU
Ifetch Reg DMem Reg
r xor r10,r1,r11
The use of the result of the ADD instruction in the next three instructions causes a hazard,
since the register is not written until after those instructions read it.

Figure 3.9
Appendix A - Pipelining 58
Forwarding is the concept of making data

Data Hazards
available to the input of the ALU for
subsequent instructions, even though the
generating instruction hasn’t gotten to WB in
Forwarding To Avoid order to write the memory or registers.
Data Hazard
Time (clock cycles)
I
n

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

s
t

ALU
Reg
r. sub r4,r1,r3 Ifetch Reg DMem

ALU
Ifetch Reg DMem Reg
r and r6,r1,r7
d
e

ALU
Ifetch Reg DMem Reg
r or r8,r1,r9

ALU
Ifetch Reg DMem Reg
xor r10,r1,r11

Figure 3.10
Appendix A - Pipelining 59
The data isn’t loaded until after the
Data Hazards MEM stage.
Time (clock cycles)

ALU
lw r1, 0(r2) Ifetch Reg DMem Reg

n
s
t

ALU
Ifetch Reg DMem Reg
sub r4,r1,r6
r.

ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e

ALU
Ifetch Reg DMem Reg

r or r8,r1,r9

There are some instances where hazards occur, even with forwarding.
Figure 3.12
Appendix A - Pipelining 60
Data Hazards The stall is necessary as shown
here.
Time (clock cycles)
I
n
lw r1, 0(r2)

ALU
Ifetch Reg DMem Reg
s
t
r.

ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg

O
r
d Bubble

ALU
Ifetch Reg DMem Reg
e and r6,r1,r7
r
Bubble

ALU
Ifetch Reg DMem
or r8,r1,r9

There are some instances where hazards occur, even with forwarding.
Figure 3.13
Appendix A - Pipelining 61
This is another

Data Hazards representation of


the stall.

LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID EX MEM WB

AND R6, R1, R7 IF ID EX MEM WB

OR R8, R1, R9 IF ID EX MEM WB

LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID stall EX MEM WB

AND R6, R1, R7 IF stall ID EX MEM WB

OR R8, R1, R9 stall IF ID EX MEM WB

Appendix A - Pipelining 62
Data Hazards Pipeline Scheduling

Instruction scheduled by compiler - move instruction in order to reduce stall.

lw Rb, b ­code sequence for a = b+c before scheduling


lw Rc, c
Add Ra, Rb, Rc ­stall
sw a, Ra
lw Re, e ­code sequence for d = e+f before scheduling
lw Rf, f
sub Rd, Re, Rf ­stall
sw d, Rd

Arrangement of code after scheduling.

lw Rb, b
lw Rc, c
lw Re, e
Add Ra, Rb, Rc
lw Rf, f
sw a, Ra
sub Rd, Re, Rf
sw d, Rd

Appendix A - Pipelining 63
Data Hazards Pipeline Scheduling

scheduled unscheduled

54%
gcc 31%

spice 42%
14%
65%
tex
25%

0% 20% 40% 60% 80%


% loads stalling pipeline

Appendix A - Pipelining 64
Control Hazards
A.1 What is Pipelining?
A.2 The Major Hurdle of Pipelining-
Structural Hazards
-- Structural Hazards
– Data Hazards A control hazard is when we
– Control Hazards need to find the destination
A.3 How is Pipelining Implemented of a branch, and can’t fetch
A.4 What Makes Pipelining Hard to
any new instructions until
Implement? we know that destination.
A.5 Extending the MIPS Pipeline to
Handle Multi-cycle Operations

Appendix A - Pipelining 65
Control Hazard on
Control Hazards Branches
Three Stage Stall

ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg

ALU
14: and r2,r3,r5 Ifetch Reg DMem Reg

ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg

ALU
Reg
22: add r8,r1,r9 Ifetch Reg DMem

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg

Appendix A - Pipelining 66
Control Hazards Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
(Whoa! How did we get that 1.9???)
• Two part solution to this dramatic increase:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier

• MIPS branch tests if register = 0 or ^ 0

• MIPS Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
• must be fast
• can't afford to subtract
• compares with 0 are simple
• Greater-than, Less-than test sign­bit, but not-equal must OR all bits
• more general compares need ALU
– 1 clock cycle penalty for branch versus 3

In the next chapter, we’ll look at ways to avoid the branch all together.

Appendix A - Pipelining 67
Control Hazards Five Branch Hazard
Alternatives
#1: Stall until branch direction is clear

#2: Predict Branch Not Taken


– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% MIPS branches not taken on average
– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken


– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before outcome

Appendix A - Pipelining 68
Control Hazards Five Branch Hazard
Alternatives
#4: Execute Both Paths

#5: Delayed Branch


– Define branch to take place AFTER a following instruction

branch instruction
sequential successor1
sequential successor2
........
sequential successorn Branch delay of length n
branch target if taken

– 1 slot delay allows proper decision and branch target address in 5


stage pipeline
– MIPS uses this

Appendix A - Pipelining 69
Control Hazards Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Cancelling branches allow more slots to be filled

• Compiler effectiveness for single branch delay slot:


– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful in
computation
– About 50% (60% x 80%) of slots usefully filled

• Delayed Branch downside: 7-8 stage pipelines, multiple instructions


issued per clock (superscalar)

Appendix A - Pipelining 70
Control Hazards Evaluating Branch
Alternatives

Pipeline speedup = Pipeline depth


1 +Branch frequencyBranch penalty

Scheduling Branch CPI speedup v. Speedup v.


scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31

Conditional & Unconditional = 14%, 65% change PC

Appendix A - Pipelining 71
Control Hazards Pipelining Introduction
Summary

• Just overlap tasks, and easy if tasks are independent


• Speed Up Š Pipeline Depth; if ideal CPI is 1, then:

Pipeline Depth Clock Cycle Unpipelined


Speedup = X
1 + Pipeline stall CPI Clock Cycle Pipelined

• Hazards limit performance on computers:


– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling
– Control: delayed branch, prediction

Appendix A - Pipelining 72
Control Hazards Compiler “Static”
Prediction of
The compiler can program what it thinks the
branch direction will be. Here are the Taken/Untaken Branches
results when it does so.

70% 14%

60% 12%

50% 10%

40% 8%
Frequency of Misprediction

Misprediction Rate
30% 6%

20% 4%

10% 2%
0% 0%
gcc

gcc
ora
doduc

ora
doduc
alvinn

alvinn
tomcatv

tomcatv
hydro2d

hydro2d
mdljsp2

mdljsp2
espresso

swm256

espresso

swm256
compress
compress

Always taken Taken backwards


Not Taken Forwards
Appendix A - Pipelining 73
Control Hazards Compiler “Static”
Prediction of
Taken/Untaken Branches

• Improves strategy for placing instructions in delay slot

• Two strategies
– Backward branch predict taken, forward branch not taken
– Profile-based prediction: record branch behavior, predict branch
based on prior run

Appendix A - Pipelining 74
Control Hazards Evaluating Static
Branch Prediction
Strategies

100000

• Misprediction ignores
frequency of branch 10000

• “Instructions between

Instructions per mispredicted branch


1000
mispredicted branches”
is a better metric
100

10

gcc
doduc

ora
alvinn

tomcatv
hydro2d

mdljsp2
espresso

swm256
compress

Profile-based Direction-based

Appendix A - Pipelining 75
An example execution sequence
 Here’s a sample sequence of instructions to execute.

addresses 1000: lw $8, 4($29)


1004: sub $2, $4, $5
in decimal
1008: and $9, $10, $11
1012: or $16, $17, $18
1016: add $13, $14, $0

 We’ll make some assumptions, just so we can show actual data values.
— Each register contains its number plus 100. For instance, register $8
contains 108, register $29 contains 129, and so forth.
— Every data memory location contains 99.
 Our pipeline diagrams will follow some conventions.
— An X indicates values that aren’t important, like the constant field of
an R-type instruction.
— Question marks ??? indicate values we don’t know, usually resulting
from instructions coming before and after the ones in our example.

76
Cycle 1 (filling)
IF: lw $8, 4($29) ID: ??? EX: ??? MEM: ??? WB: ???

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P 1004
Add
C Shift
RegWrite (?) left 2

??? ??? ???


1000 Read Read
register 1 data 1 MemWrite (?)
ALU
Read Instruction ??? ??? Zero
Read Read ??? ???
address [31-0] 0
register 2 data 2 Result Address
??? Write ??? MemToReg
1 Data
Instruction register (?)
memory
memory ??? Registers ALUOp (???)
Write ???
data ALUSrc (?) ??? Write Read
1
data data
??? Sign ???
RegDst (?) ???
extend MemRead (?)
0
??? ???
0 ??? ??? ???
??? ???
1

???

77
Cycle 2
IF: sub $2, $4, $5 ID: lw $8, 4($29) EX: ??? MEM: ??? WB: ???

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P 1008
Add
C Shift
RegWrite (?) left 2

29 129 ???
1004 Read Read
register 1 data 1 MemWrite (?)
ALU
Read Instruction X X ??? Zero
Read Read ???
address [31-0] 0
register 2 data 2 Result Address
??? Write ??? MemToReg
1 Data
Instruction register (?)
memory
memory ??? Registers ALUOp (???)
Write ???
data ALUSrc (?) ??? Write Read
1
data data
4 Sign ???
RegDst (?) ???
extend MemRead (?)
0
8 ???
0 ??? ??? ???
X ???
1

???

78
Cycle 3
IF: and $9, $10, $11 ID: sub $2, $4, $5 EX: lw $8, 4($29) MEM: ??? WB: ???

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P 1012
Add
C Shift
RegWrite (?) left 2

4 104 129
1008 Read Read
register 1 data 1 MemWrite (?)
ALU
Read Instruction 5 X Zero
Read Read 105 ???
address [31-0] 0
register 2 data 2 Result Address
4
??? Write 133 MemToReg
1 Data
Instruction register (?)
memory
memory ??? Registers ALUOp (add)
Write
??? Write ???
data ALUSrc (1) Read
1
data data
X Sign 4
RegDst (0)
extend MemRead (?) ???
0
X 8
0 8 ??? ???
2 X
1

???

79
Cycle 4
IF: or $16, $17, $18 ID: and $9, $10, $11 EX: sub $2, $4, $5 MEM: lw $8, 4($29) WB: ???

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P 1016
Add
C Shift
RegWrite (?) left 2

10 110 104
1012 Read Read
register 1 data 1 MemWrite (0)
ALU
Read Instruction 11 105 Zero
Read Read 111 133
address [31-0] 0
register 2 data 2 Result Address
–1
??? Write MemToReg
1 Data
Instruction register (?)
memory
memory ??? Registers ALUOp (sub)
Write 99 ???
data ALUSrc (0) X Write Read
1
data data
X Sign X
RegDst (1) ???
extend MemRead (1)
0
X X
0 2 8 ???
9 2
1

???

80
Cycle 5 (full)
IF: add $13, $14, $0 ID: or $16, $17, $18 EX: and $9, $10, $11 MEM: sub $2, $4, $5 WB:
lw $8, 4($29)
1

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P 1020
Add
C Shift
RegWrite (1) left 2

17 117 110
1016 Read Read
register 1 data 1 MemWrite (0)
ALU
Read Instruction 18 111 Zero
Read Read 118 -1
address [31-0] 0
register 2 data 2 Result Address
8 Write 110 MemToReg
1 Data
Instruction register (1)
memory
memory 99 Registers ALUOp (and)
Write X 99
data ALUSrc (0) 105 Write Read
1
data data
X Sign X
RegDst (1) 133
extend MemRead (0)
0
X X
0 9 2 8
16 9
1

99

81
Cycle 6 (emptying)
IF: ??? ID: add $13, $14, $0 EX: or $16, $17, $18 MEM: and $9, $10, $11 WB: sub
$2, $4, $5
1

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P ???
Add
C Shift
RegWrite (1) left 2

14 114 117
1020 Read Read
register 1 data 1 MemWrite (0)
ALU
Read Instruction 0 0 118 Zero
Read Read 110
address [31-0] 0
register 2 data 2 Result Address
2 Write 119 MemToReg
1 Data
Instruction register (0)
memory
memory -1 Registers ALUOp (or)
Write X
data ALUSrc (0) 111 Write Read
1
data data
X Sign X
RegDst (1)
extend MemRead (0)
0
X X
0 16 9
13 16
1

82
Cycle 7
IF: ??? ID: ??? EX: add $13, $14, $0 MEM: or $16, $17, $18 WB: and
$9, $10, $11
1

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P ???
Add
C Shift
RegWrite (1) left 2

??? ??? 114


??? Read Read
register 1 data 1 MemWrite (0)
ALU
Read Instruction ??? 0 Zero
Read Read ??? 119
address [31-0] 0
register 2 data 2 Result Address
9 Write 114 MemToReg
1 Data
Instruction register (0)
memory
memory 110 Registers ALUOp (add)
Write
X X
data ALUSrc (0) 118 Write Read
1
data data
??? Sign X
RegDst (1)
extend MemRead (0) 110
0
??? X
0 13 16 9
??? 13
1

110

83
Cycle 8
IF: ??? ID: ??? EX: ??? MEM: add $13, $14, $0 WB: or $16,
$17, $18
1

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P ???
Add
C Shift
RegWrite (1) left 2

??? ??? ???


??? Read Read
register 1 data 1 MemWrite (0)
ALU
Read Instruction ??? ??? Zero
Read Read ??? 114
address [31-0] 0
register 2 data 2 Result Address
16 Write ??? MemToReg
1 Data
Instruction register (0)
memory
memory 119 Registers ALUOp (???)
Write
X X
data ALUSrc (?) 0 Write Read
1
data data
??? Sign ???
RegDst (?)
extend MemRead (0) 119
0
??? ???
0 ??? 13 16
??? ???
1

119

84
Cycle 9
IF: ??? ID: ??? EX: ??? MEM: ??? WB: add
$13, $14, $0
1

0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB

4 Add
P ???
Add
C Shift
RegWrite (1) left 2

??? ??? ???


??? Read Read
register 1 data 1 MemWrite (?)
ALU
Read Instruction ??? ??? Zero
Read Read ??? ???
address [31-0] 0
register 2 data 2 Result Address
13 Write ??? MemToReg
1 Data
Instruction register (0)
memory
memory 114 Registers ALUOp (???)
Write X X
data ALUSrc (?) ? Write Read
1
data data
??? Sign ???
RegDst (?) 114
extend MemRead (?)
0
??? ???
0 ??? ??? 13
??? ???
1

114

85
That’s a lot of diagrams there
Clock cycle
1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WB
sub $v0, $a0, $a1 IF ID EX MEM WB
and $t1, $t2, $t3 IF ID EX MEM WB
or $s0, $s1, $s2 IF ID EX MEM WB
add $t5, $t6, $0 IF ID EX MEM WB

 Compare the last nine slides with the pipeline diagram above.
— You can see how instruction executions are overlapped.
— Each functional unit is used by a different instruction in each cycle.
— The pipeline registers save control and data values generated in
previous clock cycles for later use.
— When the pipeline is full in clock cycle 5, all of the hardware units are
utilized. This is the ideal situation, and what makes pipelined
processors so fast.
 Try to understand this example or the similar one in the book at the end
of Section 6.3.

86
Performance Revisited

 Assuming the following functional unit latencies:

3ns 2ns 2ns 3ns 2ns


Inst Reg Data Reg

ALU
mem Read Mem Write

 What is the cycle time of a single-cycle implementation?


— What is its throughput?

 What is the cycle time of a ideal pipelined implementation?


— What is its steady-state throughput?

 How much faster is pipelining?

87

You might also like