Organization of Multiprocessor Systems
Organization of Multiprocessor Systems
Systems
Flynn’s Classification
Was proposed by researcher Michael J. Flynn in 1966.
It is the most commonly accepted taxonomy of
computer organization.
In this classification, computers are classified by
whether it processes a single instruction at a time or
multiple instructions simultaneously, and whether it
operates on one or multiple data sets.
1
Taxonomy of Computer
Architectures
Simple Diagrammatic Representation
3
SISD
Simple Diagrammatic Representation
4
Single Instruction, Multiple
Data (SIMD)
An SIMD machine executes a single instruction
on multiple data values simultaneously using
many processors.
Since there is only one instruction, each processor
does not have to fetch and decode each
instruction. Instead, a single control unit does the
fetch and decoding for all processors.
SIMD architectures include array processors.
5
SIMD
Simple Diagrammatic Representation
6
Multiple Instruction, Multiple
Data (MIMD)
MIMD machines are usually referred to as
multiprocessors or multicomputers.
It may execute multiple instructions
simultaneously, contrary to SIMD machines.
Each processor must include its own control unit
that will assign to the processors parts of a task or
a separate task.
It has two subclasses: Shared memory and
distributed memory
7
MIMD
Simple Diagrammatic Representation Simple Diagrammatic
(Shared Memory) Representation(DistributedMemory)
8
Multiple Instruction, Single
Data (MISD)
This category does not actually exist. This
category was included in the taxonomy for
the sake of completeness.
9
Analogy of Flynn’s
Classifications
An analogy of Flynn’s classification is the
check-in desk at an airport
SISD: a single desk
SIMD: many desks and a supervisor with a
megaphone giving instructions that every desk
obeys
MIMD: many desks working at their own pace,
synchronized through a central database
10
System Topologies
Topologies
A system may also be classified by its topology.
A topology is the pattern of connections
between processors.
The cost-performance tradeoff determines
which topologies to use for a multiprocessor
system.
11
Topology Classification
A topology is characterized by its diameter, total
bandwidth, and bisection bandwidth
– Diameter – the maximum distance between two
processors in the computer system.
– Total bandwidth – the capacity of a communications
link multiplied by the number of such links in the
system.
– Bisection bandwidth – represents the maximum data
transfer that could occur at the bottleneck in the
topology.
12
System Topologies
M M M
Shared
Bus
Topology
– Processors communicate
with each other via a single P P P
bus that can only handle
one data transmissions at a
time.
– In most shared buses, Shared Bus
processors directly
communicate with their
own local memory. Global
memory
13
System Topologies
P
Ring Topology
– Uses direct connections
between processors
instead of a shared bus. P P
– Allows communication
links to be active
simultaneously but data P P
may have to travel
through several processors
to reach its destination. P
14
System Topologies
Tree Topology P
– Uses direct
connections between
processors; each P P
having three
connections.
– There is only one
unique path between P P P P
any pair of processors.
15
Systems Topologies
Mesh Topology
P P P
– In the mesh topology,
every processor
connects to the
processors above and P P P
below it, and to its
right and left.
P P P
16
System Topologies
Hypercube Topology
– Is a multiple mesh
topology. P P P P
– Each processor
connects to all other P P P P
processors whose
binary values differ by P P P P
one bit. For example,
processor 0(0000) P P P P
connects to 1(0001) or
2(0010).
17
System Topologies
Completely
P P
Connected Topology
Every processor has P P
n-1 connections, one to each of
the other processors. P P
There is an increase in
complexity as the system P P
grows but this offers maximum
communication capabilities.
18
MIMD System Architectures
19
Uniform memory access
(UMA)
The UMA is a type of symmetric
multiprocessor, or SMP, that has two or
more processors that perform symmetric
functions. UMA gives all CPUs equal
(uniform) access to all memory locations in
shared memory. They interact with shared
memory by some communications
mechanism like a simple bus or a complex
multistage interconnection network.
20
Uniform memory access
(UMA) Architecture
Processor 1
Processor 2
Communications Shared
mechanism Memory
Processor n
21
Nonuniform memory access
(NUMA)
NUMA architectures, unlike UMA
architectures do not allow uniform access to
all shared memory locations. This
architecture still allows all processors to
access all shared memory locations but in a
nonuniform way, each processor can access
its local shared memory more quickly than
the other memory modules not next to it.
22
Nonuniform memory access
(NUMA) Architecture
Communications mechanism
23
What is Pipelining?
• Key idea:
overlap execution of multiple instructions
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T Time
a A
s
k
B
O
r
d C
e
r D
• Time Required: 8 hours for 4 loads
To Pipeline, We Overlap Tasks
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 Time
T
a
s A
k
O
B
r
d C
e
r D
• Time Required: 3.5 Hours for 4 Loads
To Pipeline, We Overlap Tasks
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 Time
T
a • Pipelining doesn’t help latency of
s A single task, it helps throughput of
k entire workload
O
B • Pipeline rate limited by slowest
r pipeline stage
d C
e • Multiple tasks operating
r D simultaneously
• Potential speedup = Number
pipe stages
• Unbalanced lengths of pipe
stages reduces speedup
• Time to “fill” pipeline and time to
“drain” it reduces speedup
Pipelining a Digital System
1 nanosecond = 10^-9 second
1 picosecond = 10^-12 second
1ns
• Separate each piece with a pipeline register
Pipeline
Register
Pipelining a Digital System
Non-pipelined:
1 operation finishes
every 1ns
1ns
Pipelined:
1 operation finishes
every 200ps
ALU
Ifetch Reg DMem Reg
I
n
s
ALU
t Ifetch Reg DMem Reg
r.
ALU
O Ifetch Reg DMem Reg
r
d
ALU
e Ifetch Reg DMem Reg
Non-Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600 1800
Order Time
Instruction REG REG
lw $1, 100($0) ALU MEM
Fetch RD WR
Instruction REG REG
lw $2, 200($0) Fetch
ALU MEM
RD WR
800ps
Instruction
lw $3, 300($0)
Fetch
800ps
800ps
Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600
Order Time
Instruction REG REG
lw $1, 100($0) ALU MEM
Fetch RD WR
Instruction REG REG
lw $2, 200($0) Fetch
ALU MEM
RD WR
200ps
Instruction REG REG
lw $3, 300($0) ALU MEM
Fetch RD WR
200ps
200ps 200ps 200ps 200ps 200ps
Speedup
• Consider the unpipelined processor introduced previously. Assume that
it has a 1 ns clock cycle and it uses 4 cycles for ALU operations and
branches, and 5 cycles for memory operations, assume that the relative
frequencies of these operations are 40%, 20%, and 40%, respectively.
Suppose that due to clock skew and setup, pipelining the processor
adds 0.2ns of overhead to the clock. Ignoring any latency impact, how
much speedup in the instruction execution rate will we gain from a
pipeline?
Appendix A - Pipelining 39
What Is
Pipelining MIPS Functions
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Access Back
Calc Passed To Next Stage
A <- Regs[IR6..IR10];
B <- Regs[IR10..IR15];
IR L
Imm <- ((IR16) ##IR16-31
M
D
Appendix A - Pipelining 40
What Is
Pipelining MIPS Functions
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Access Back
Calc Passed To Next Stage
A <- A func. B
cond = 0;
IR L
M
D
Appendix A - Pipelining 42
What Is
Pipelining MIPS Functions
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Access Back
Calc Passed To Next Stage
Regs <- A, B;
IR L
M
D
Appendix A - Pipelining 43
Pipeline Hazards
• Limits to pipelining: Hazards prevent
next instruction from executing during
its designated clock cycle
– Structural hazards: two different instructions use same h/w in
same cycle
– Data hazards: Instruction depends on result of prior instruction
still in the pipeline
– Control hazards: Pipelining of branches & other instructions that
change the PC
Summary - Pipelining
Overview
• Pipelining increase throughput (but
not latency)
• Hazards limit performance
– Structural hazards
– Control hazards
– Data hazards
Pipeline Hurdles
Definition
• conditions that lead to incorrect behavior if not fixed
• Structural hazard
– two different instructions use same h/w in same cycle
• Data hazard
– two different instructions use same storage
– must appear as if the instructions execute in correct order
• Control hazard
– one instruction affects which instruction is next
Resolution
• Pipeline interlock logic detects hazards and fixes them
• simple solution: stall
• increases CPI, decreases performance
• better solution: partial stall
• some instruction stall, others proceed better to stall early than late
Appendix A - Pipelining 46
Structural Hazards
When two or more
Time (clock cycles) different
instructions want to
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 use same hardware
resource in same
I Load Ifetch
ALU
Reg DMem Reg cycle
n
e.g., MEM uses the
s Instr 1
ALU
Ifetch Reg DMem Reg
same memory port
t as IF as shown in
r. this slide.
ALU
Ifetch Reg DMem Reg
Instr 2
O
ALU
r Instr 3 Ifetch Reg DMem Reg
d
e
ALU
Ifetch Reg DMem Reg
Instr 4
r
Figure 3.6
Appendix A - Pipelining 47
Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 This is another
way of looking
at the effect of a
I Load Ifetch
ALU
Reg DMem Reg
stall.
n
s
ALU
Reg
t Instr 1
Ifetch Reg DMem
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r
ALU
Ifetch Reg DMem Reg
Instr 3
Figure 3.7
Appendix A - Pipelining 48
Structural Hazards
Appendix A - Pipelining 49
Structural Hazards
Dealing with Structural Hazards
Stall
• low cost, simple
• Increases CPI
• use for rare case since stalling has performance effect
Pipeline hardware resource
• useful for multi-cycle resources
• good performance
• sometimes complex e.g., RAM
Replicate resource
• good performance
• increases cost (+ maybe interconnect delay)
• useful for cheap or divisible resources
Appendix A - Pipelining 50
Structural Hazards
Structural hazards are reduced with these rules:
• Each instruction uses a resource at most once
• Always use the resource in the same pipeline stage
• Use the resource for one cycle only
Many RISC ISA’a designed with this in mind
Sometimes very complex to do this. For example, memory of
necessity is used in the IF and MEM stages.
Appendix A - Pipelining 51
Structural Hazards
This is the example on Page 144.
Appendix A - Pipelining 52
Data Hazards
A.1 What is Pipelining?
A.2 The Major Hurdle of Pipelining-
Structural Hazards
-- Structural Hazards
These occur when at any time, there are
– Data Hazards instructions active that need to access the
– Control Hazards same data (memory or register) locations.
A.3 How is Pipelining Implemented
Where there’s real trouble is when we have:
A.4 What Makes Pipelining Hard to
Implement?
instruction A
A.5 Extending the MIPS Pipeline to
Handle Multi-cycle Operations
instruction B
Appendix A - Pipelining 53
Data Hazards
Execution Order is:
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
InstrI
InstrJ
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler nomenclature).
This hazard results from an actual need for
communication.
Appendix A - Pipelining 54
Data Hazards
Execution Order is:
Write After Read (WAR)
InstrJ tries to write operand before InstrI reads i
InstrI
– Gets wrong operand
InstrJ
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Appendix A - Pipelining 55
Data Hazards
Execution Order is:
Write After Write (WAW)
InstrJ tries to write operand before InstrI writes it
InstrI
– Leaves wrong result ( InstrI not InstrJ )
InstrJ
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Appendix A - Pipelining 56
Data Hazards
Simple Solution to RAW
Appendix A - Pipelining 57
Data Hazards
Time (clock cycles)
IF ID/RF EX MEM WB
I
ALU
Reg Reg
add r1,r2,r3Ifetch DMem
n
s
ALU
Ifetch Reg DMem Reg
t sub r4,r1,r3
r.
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
O
r
ALU
Ifetch Reg DMem Reg
d or r8,r1,r9
e
ALU
Ifetch Reg DMem Reg
r xor r10,r1,r11
The use of the result of the ADD instruction in the next three instructions causes a hazard,
since the register is not written until after those instructions read it.
Figure 3.9
Appendix A - Pipelining 58
Forwarding is the concept of making data
Data Hazards
available to the input of the ALU for
subsequent instructions, even though the
generating instruction hasn’t gotten to WB in
Forwarding To Avoid order to write the memory or registers.
Data Hazard
Time (clock cycles)
I
n
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
ALU
Reg
r. sub r4,r1,r3 Ifetch Reg DMem
ALU
Ifetch Reg DMem Reg
r and r6,r1,r7
d
e
ALU
Ifetch Reg DMem Reg
r or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Figure 3.10
Appendix A - Pipelining 59
The data isn’t loaded until after the
Data Hazards MEM stage.
Time (clock cycles)
ALU
lw r1, 0(r2) Ifetch Reg DMem Reg
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r6
r.
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e
ALU
Ifetch Reg DMem Reg
r or r8,r1,r9
There are some instances where hazards occur, even with forwarding.
Figure 3.12
Appendix A - Pipelining 60
Data Hazards The stall is necessary as shown
here.
Time (clock cycles)
I
n
lw r1, 0(r2)
ALU
Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d Bubble
ALU
Ifetch Reg DMem Reg
e and r6,r1,r7
r
Bubble
ALU
Ifetch Reg DMem
or r8,r1,r9
There are some instances where hazards occur, even with forwarding.
Figure 3.13
Appendix A - Pipelining 61
This is another
Appendix A - Pipelining 62
Data Hazards Pipeline Scheduling
lw Rb, b
lw Rc, c
lw Re, e
Add Ra, Rb, Rc
lw Rf, f
sw a, Ra
sub Rd, Re, Rf
sw d, Rd
Appendix A - Pipelining 63
Data Hazards Pipeline Scheduling
scheduled unscheduled
54%
gcc 31%
spice 42%
14%
65%
tex
25%
Appendix A - Pipelining 64
Control Hazards
A.1 What is Pipelining?
A.2 The Major Hurdle of Pipelining-
Structural Hazards
-- Structural Hazards
– Data Hazards A control hazard is when we
– Control Hazards need to find the destination
A.3 How is Pipelining Implemented of a branch, and can’t fetch
A.4 What Makes Pipelining Hard to
any new instructions until
Implement? we know that destination.
A.5 Extending the MIPS Pipeline to
Handle Multi-cycle Operations
Appendix A - Pipelining 65
Control Hazard on
Control Hazards Branches
Three Stage Stall
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
14: and r2,r3,r5 Ifetch Reg DMem Reg
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
Reg
22: add r8,r1,r9 Ifetch Reg DMem
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
Appendix A - Pipelining 66
Control Hazards Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
(Whoa! How did we get that 1.9???)
• Two part solution to this dramatic increase:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• MIPS Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
• must be fast
• can't afford to subtract
• compares with 0 are simple
• Greater-than, Less-than test signbit, but not-equal must OR all bits
• more general compares need ALU
– 1 clock cycle penalty for branch versus 3
In the next chapter, we’ll look at ways to avoid the branch all together.
Appendix A - Pipelining 67
Control Hazards Five Branch Hazard
Alternatives
#1: Stall until branch direction is clear
Appendix A - Pipelining 68
Control Hazards Five Branch Hazard
Alternatives
#4: Execute Both Paths
branch instruction
sequential successor1
sequential successor2
........
sequential successorn Branch delay of length n
branch target if taken
Appendix A - Pipelining 69
Control Hazards Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Cancelling branches allow more slots to be filled
Appendix A - Pipelining 70
Control Hazards Evaluating Branch
Alternatives
Appendix A - Pipelining 71
Control Hazards Pipelining Introduction
Summary
Appendix A - Pipelining 72
Control Hazards Compiler “Static”
Prediction of
The compiler can program what it thinks the
branch direction will be. Here are the Taken/Untaken Branches
results when it does so.
70% 14%
60% 12%
50% 10%
40% 8%
Frequency of Misprediction
Misprediction Rate
30% 6%
20% 4%
10% 2%
0% 0%
gcc
gcc
ora
doduc
ora
doduc
alvinn
alvinn
tomcatv
tomcatv
hydro2d
hydro2d
mdljsp2
mdljsp2
espresso
swm256
espresso
swm256
compress
compress
• Two strategies
– Backward branch predict taken, forward branch not taken
– Profile-based prediction: record branch behavior, predict branch
based on prior run
Appendix A - Pipelining 74
Control Hazards Evaluating Static
Branch Prediction
Strategies
100000
• Misprediction ignores
frequency of branch 10000
• “Instructions between
10
gcc
doduc
ora
alvinn
tomcatv
hydro2d
mdljsp2
espresso
swm256
compress
Profile-based Direction-based
Appendix A - Pipelining 75
An example execution sequence
Here’s a sample sequence of instructions to execute.
We’ll make some assumptions, just so we can show actual data values.
— Each register contains its number plus 100. For instance, register $8
contains 108, register $29 contains 129, and so forth.
— Every data memory location contains 99.
Our pipeline diagrams will follow some conventions.
— An X indicates values that aren’t important, like the constant field of
an R-type instruction.
— Question marks ??? indicate values we don’t know, usually resulting
from instructions coming before and after the ones in our example.
76
Cycle 1 (filling)
IF: lw $8, 4($29) ID: ??? EX: ??? MEM: ??? WB: ???
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P 1004
Add
C Shift
RegWrite (?) left 2
???
77
Cycle 2
IF: sub $2, $4, $5 ID: lw $8, 4($29) EX: ??? MEM: ??? WB: ???
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P 1008
Add
C Shift
RegWrite (?) left 2
29 129 ???
1004 Read Read
register 1 data 1 MemWrite (?)
ALU
Read Instruction X X ??? Zero
Read Read ???
address [31-0] 0
register 2 data 2 Result Address
??? Write ??? MemToReg
1 Data
Instruction register (?)
memory
memory ??? Registers ALUOp (???)
Write ???
data ALUSrc (?) ??? Write Read
1
data data
4 Sign ???
RegDst (?) ???
extend MemRead (?)
0
8 ???
0 ??? ??? ???
X ???
1
???
78
Cycle 3
IF: and $9, $10, $11 ID: sub $2, $4, $5 EX: lw $8, 4($29) MEM: ??? WB: ???
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P 1012
Add
C Shift
RegWrite (?) left 2
4 104 129
1008 Read Read
register 1 data 1 MemWrite (?)
ALU
Read Instruction 5 X Zero
Read Read 105 ???
address [31-0] 0
register 2 data 2 Result Address
4
??? Write 133 MemToReg
1 Data
Instruction register (?)
memory
memory ??? Registers ALUOp (add)
Write
??? Write ???
data ALUSrc (1) Read
1
data data
X Sign 4
RegDst (0)
extend MemRead (?) ???
0
X 8
0 8 ??? ???
2 X
1
???
79
Cycle 4
IF: or $16, $17, $18 ID: and $9, $10, $11 EX: sub $2, $4, $5 MEM: lw $8, 4($29) WB: ???
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P 1016
Add
C Shift
RegWrite (?) left 2
10 110 104
1012 Read Read
register 1 data 1 MemWrite (0)
ALU
Read Instruction 11 105 Zero
Read Read 111 133
address [31-0] 0
register 2 data 2 Result Address
–1
??? Write MemToReg
1 Data
Instruction register (?)
memory
memory ??? Registers ALUOp (sub)
Write 99 ???
data ALUSrc (0) X Write Read
1
data data
X Sign X
RegDst (1) ???
extend MemRead (1)
0
X X
0 2 8 ???
9 2
1
???
80
Cycle 5 (full)
IF: add $13, $14, $0 ID: or $16, $17, $18 EX: and $9, $10, $11 MEM: sub $2, $4, $5 WB:
lw $8, 4($29)
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P 1020
Add
C Shift
RegWrite (1) left 2
17 117 110
1016 Read Read
register 1 data 1 MemWrite (0)
ALU
Read Instruction 18 111 Zero
Read Read 118 -1
address [31-0] 0
register 2 data 2 Result Address
8 Write 110 MemToReg
1 Data
Instruction register (1)
memory
memory 99 Registers ALUOp (and)
Write X 99
data ALUSrc (0) 105 Write Read
1
data data
X Sign X
RegDst (1) 133
extend MemRead (0)
0
X X
0 9 2 8
16 9
1
99
81
Cycle 6 (emptying)
IF: ??? ID: add $13, $14, $0 EX: or $16, $17, $18 MEM: and $9, $10, $11 WB: sub
$2, $4, $5
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P ???
Add
C Shift
RegWrite (1) left 2
14 114 117
1020 Read Read
register 1 data 1 MemWrite (0)
ALU
Read Instruction 0 0 118 Zero
Read Read 110
address [31-0] 0
register 2 data 2 Result Address
2 Write 119 MemToReg
1 Data
Instruction register (0)
memory
memory -1 Registers ALUOp (or)
Write X
data ALUSrc (0) 111 Write Read
1
data data
X Sign X
RegDst (1)
extend MemRead (0)
0
X X
0 16 9
13 16
1
82
Cycle 7
IF: ??? ID: ??? EX: add $13, $14, $0 MEM: or $16, $17, $18 WB: and
$9, $10, $11
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P ???
Add
C Shift
RegWrite (1) left 2
110
83
Cycle 8
IF: ??? ID: ??? EX: ??? MEM: add $13, $14, $0 WB: or $16,
$17, $18
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P ???
Add
C Shift
RegWrite (1) left 2
119
84
Cycle 9
IF: ??? ID: ??? EX: ??? MEM: ??? WB: add
$13, $14, $0
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4 Add
P ???
Add
C Shift
RegWrite (1) left 2
114
85
That’s a lot of diagrams there
Clock cycle
1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WB
sub $v0, $a0, $a1 IF ID EX MEM WB
and $t1, $t2, $t3 IF ID EX MEM WB
or $s0, $s1, $s2 IF ID EX MEM WB
add $t5, $t6, $0 IF ID EX MEM WB
Compare the last nine slides with the pipeline diagram above.
— You can see how instruction executions are overlapped.
— Each functional unit is used by a different instruction in each cycle.
— The pipeline registers save control and data values generated in
previous clock cycles for later use.
— When the pipeline is full in clock cycle 5, all of the hardware units are
utilized. This is the ideal situation, and what makes pipelined
processors so fast.
Try to understand this example or the similar one in the book at the end
of Section 6.3.
86
Performance Revisited
ALU
mem Read Mem Write
87