Lec 1
Lec 1
Fall, 2017
These slides are adapted from notes by Dr. David Patterson (UCB)
1
Single-Cycle vs. Pipelined Execution
Non-Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600 1800
Order Time
l w $ 1 , 1 0 0 ( $ 0 ) Instruction REG ALU MEM REG
Fetch RD WR
Instruction REG REG
lw $2, 200($0) ALU MEM
Fetch RD WR
800ps
Instruction
lw $3, 300($0) Fetch
800ps
800ps
Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600
Order Time
l w $ 1 , 1 0 0 ( $ 0 ) Instruction REG ALU MEM REG
Fetch RD WR
Instruction REG REG
lw $2, 200($0) ALU MEM
Fetch RD WR
200ps
Instruction REG REG
lw $3, 300($0) ALU MEM
Fetch RD WR
200ps
200ps 200ps 200ps 200ps 200ps
2
Speedup
• Consider the unpipelined processor introduced previously. Assume that it has
a 1 ns clock cycle and it uses 4 cycles for ALU operations and branches, and
5 cycles for memory operations, assume that the relative frequencies of these
operations are 40%, 20%, and 40%, respectively. Suppose that due to clock
skew and setup, pipelining the processor adds 0.2ns of overhead to the clock.
Ignoring any latency impact, how much speedup in the instruction execution
rate will we gain from a pipeline?
3
Comments about Pipelining
4
Pipeline Hazards
• Limits to pipelining: Hazards prevent next
instruction from executing during its
designated clock cycle
– Structural hazards: two different instructions use
same h/w in same cycle
– Data hazards: Instruction depends on result of
prior instruction still in the pipeline
– Control hazards: Pipelining of branches & other
instructions that change the PC
5
Structural Hazards
• Attempt to use same resource twice at same time
• Example: Single Memory for instructions, data
– Accessed by IF stage
– Accessed at same time by MEM stage
• Solutions ?
– Delay second access by one clock cycle
– Provide separate memories for instructions, data
•This is what the book does
•This is called a “Harvard Architecture”
•Real pipelined processors have separate caches
6
Pipelined Example -
Executing Multiple Instructions
• Consider the following instruction
sequence:
lw $r0, 10($r1)
sw $sr3, 20($r4)
add $r5, $r6, $r7
sub $r8, $r9, $r10
7
Executing Multiple Instructions
Clock Cycle 1
LW
8
Executing Multiple Instructions
Clock Cycle 2
SW LW
9
Executing Multiple Instructions
Clock Cycle 3
ADD SW LW
10
Executing Multiple Instructions
Clock Cycle 4
SUB ADD SW LW
11
Executing Multiple Instructions
Clock Cycle 5
SUB ADD SW LW
12
Executing Multiple Instructions
Clock Cycle 6
SUB ADD SW
13
Executing Multiple Instructions
Clock Cycle 7
SUB ADD
14
Executing Multiple Instructions
Clock Cycle 8
SUB
15
Alternative View - Multicycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
16
Alternative View - Multicycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
Memory Conflict
17
One Memory Port Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
Ifetch Reg DMem Reg
Load
n
s
ALU
Ifetch Reg DMem Reg
t
Instr 1
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r
ALU
Ifetch Reg DMem Reg
Instr 3
18
Structural Hazards
Some Common Structural Hazards:
• Memory:
– we’ve already mentioned this one.
• Floating point:
– Since many floating point instructions require many
cycles, it’s easy for them to interfere with each other.
• Starting up more of one type of instruction than
there are resources.
– For instance, the PA-8600 can support two ALU + two
load/store instructions per cycle - that’s how much
hardware it has available.
19
Dealing with Structural Hazards
Stall
– low cost, simple
– Increases CPI
– use for rare case since stalling has performance effect
Pipeline hardware resource
– useful for multi-cycle resources
– good performance
– sometimes complex e.g., RAM
Replicate resource
– good performance
– increases cost (+ maybe interconnect delay)
– useful for cheap or divisible resources
20
Structural Hazards
• Structural hazards are reduced with these rules:
– Each instruction uses a resource at most once
– Always use the resource in the same pipeline stage
– Use the resource for one cycle only
• Many RISC ISA’s designed with this in mind
• Sometimes very complex to do this.
– For example, memory of necessity is used in the IF and
MEM stages.
21
Structural Hazards
We want to compare the performance of two machines.
Which machine is faster?
– Machine A: Dual ported memory - so there are no memory
stalls
– Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
Assume:
– Ideal CPI = 1 for both
– Loads are 40% of instructions executed
22
Speedup from Pipelining
Speedup from pipelining =
Average instruction time unpipelined
Average instruction time pipelined
CPI unpipelined ×Clock cycle unpipelined
CPI pipelined ×Clock cycle pipelined
24
Structural Hazards
We want to compare the performance of two machines. Which machine is faster?
• Machine A: Dual ported memory - so there are no memory stalls
• Machine B: Single ported memory, but its pipelined implementation has a 1.05 times
faster clock rate
Assume:
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
25
Summary - Structural Hazards
• Speed Up <= Pipeline Depth; if ideal CPI is 1, then:
Pipeline Depth Clock Cycle Unpipelined
Speedup = X
1 + Pipeline stall CPI Clock Cycle Pipelined
26
Data Hazards
• Data hazards occur when data is used
before it is stored
Time (in clock cycles)
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
The use of the result of the SUB instruction in the next three instructions causes a
data hazard, since the register is not written until after those instructions read it.
27
Data Hazards
Execution Order is: Read After Write (RAW)
InstrI
InstrJ InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
28
Data Hazards
Execution Order is: Write After Read (WAR)
InstrI
InstrJ
InstrJ tries to write operand before InstrI reads i
– Gets wrong operand
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
–All instructions take 5 stages, and
– Writes are always in stage 5