Chapter Six: 2004 Morgan Kaufmann Publishers
Chapter Six: 2004 Morgan Kaufmann Publishers
Pipelining
Improve performance by increasing instruction throughput
Program execution Time order (in instructions) 200 400 600 800 1000 1200 1400 1600 1800
AU L 800 ps
Data access
200
400
600
800
1000
1200
1400
Instruction fetch
AU L Rg e Instruction fetch
Data access AU L Rg e
Pipelining
What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Well build a simple pipeline and look at these issues Well talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.
2004 Morgan Kaufmann Publishers
Basic Idea
IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back
Ad d 4 Shift lf2 et Read R a ed regist r 1 d t 1 e a a Read regist r 2 e Regiser ts Write Ra ed register dt 2 a a Wie r t data 1 6 2 Sign 3 et n xe d AD Add D result
P C
Zr eo AU A U L L result
A de s drs
Ra ed data Data Mm r e oy
Write data
Pipelined Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
Address
Read register 1
Instruction memory
Read data 1 Read register 2 Registers Read Write data 2 register Write data
Read data
16
Sign extend
32
Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
2004 Morgan Kaufmann Publishers
Corrected Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
PC
Address
Read register 1
Instruction memory
Read data 1 Read register 2 Registers Read Write data 2 register Write data
Read data
16
Sign extend
32
IM
Reg
ALU
DM
Reg
lw $2, 200($0)
IM
Reg
ALU
DM
Reg
lw $3, 300($0)
IM
Reg
ALU
DM
Reg
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths
Pipeline Control
PCSrc
IF/ID
ID/EX
EX/MEM
MEM/WB
Add 4 Shift left 2 RegWrite PC Address Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Write data Instruction (150) 16 Sign extend 32 6 ALU control MemWrite ALUSrc Zero Add ALU result MemtoReg Address Data memory Read data Add Add result Branch
Instruction memory
MemRead
Instruction (2016)
ALUOp
Pipeline control
We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine?
Pipeline Control
Pass control signals along just like the data
Execution/Address Calculation Memory access stage stage control lines control lines Reg ALU ALU ALU Mem Mem Dst Op1 Op0 Src Branch Read Write 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0 1 0 0 1 X 0 1 0 1 0 0 Write-back stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X
WB Instruction M EX WB M WB
Control
IF/ID
ID/EX
EX/MEM
MEM/WB
10
EX/MEM WB
Control
MEM/WB WB
IF/ID
EX
Branch
PC
Address
Read register 1
Instruction memory
Read data 1 Read register 2 Registers Read Write data 2 register Write data
Read data
Instruction [150]
16
Sign extend
32
MemRead
RegDst
11
Dependencies
Problem with starting next instruction before first is finished dependencies that go backward in time are data hazards
Value of register $2: Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg Time (in clock cycles) CC 1 CC 2 10 10 CC 3 10 CC 4 10 CC 5 10/20 CC 6 20 CC 7 20 CC 8 20 CC 9 20
IM
Reg
DM
Reg
or $13, $6, $2
IM
Reg
DM
Reg
IM
Reg
DM
Reg
sw $15, 100($2)
IM
Reg
DM
Reg
12
Software Solution
Have compiler guarantee no hazards Where do we insert the nops ? sub and or add sw $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2)
13
Forwarding
Use temporary results, dont wait for them to be written register file forwarding to handle read/write to same register ALU forwarding
Time (in clock cycles) CC 1 CC 2 Value of register $2: 10 10 Value of EX/MEM: X X Value of MEM/WB: X X Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg CC 3 10 X X CC 4 10 20 X CC 5 10/20 X 20 CC 6 20 X X CC 7 20 X X CC 8 20 X X CC 9 20 X X
IM
Reg
DM
Reg
or $13, $6, $2
IM
Reg
DM
Reg
add $14,$2 , $2
IM
Reg
DM
Reg
sw $15, 100($2)
IM
Reg
DM
Reg
14
Forwarding
The main idea (some details not shown)
EX/MEM
MEM/WB
Data memory
M u x
ForwardB R s R t R t R d EX/MEM.RegisterRd
M u x Forwarding unit
MEM/WB.RegisterRd
15
IM
Reg
DM
Reg
or $8, $2, $6
IM
Reg
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
16
Stalling
We can stall the pipeline by keeping an instruction in the same stage
Time (in clock cycles) CC 1 CC 2 CC 3 Program execution order (in instructions) lw $2, 20($1) IM Reg DM Reg bubble and becomes nop IM Reg DM Reg CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
IM
Reg
DM
Reg
or $8, $2, $6
IM
Reg
DM
Reg
IM
Reg
DM
Reg
17
18
Branch Hazards
When we decide to branch, other instructions are in the pipeline!
Time (in clock cycles) CC 1 Program execution order (in instructions) 40 beq $1, $3, 28 IM Reg DM Reg CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM
Reg
DM
Reg
48 or $13, $6, $2
IM
Reg
DM
Reg
IM
Reg
DM
Reg
72 lw $4, 50($7)
IM
Reg
DM
Reg
We are predicting branch not taken need to add hardware for flushing instructions if we are wrong
2004 Morgan Kaufmann Publishers
19
Flushing Instructions
IF.Flush Hazard detection unit ID/EX WB Control 0 IF/ID + 4 Shift left 2 M u x ALU M u x Sign extend Data memory M u x + M u x M EX EX/MEM WB M EX/MEM WB
M u x
M u x Fowarding unit
20
Branches
If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction
Taken Not taken Predict taken Taken Taken Not taken Predict not taken Taken Not taken Predict not taken Not taken Predict taken
21
Branch Prediction
Sophisticated Techniques:
Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. A branch delay slot which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time!
22
Improving Performance
Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, 0($t1) 4($t1) 0($t1) 4($t1)
Dynamic Pipeline Scheduling Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesnt wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect)
23
Advanced Pipelining
Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) Superscalar processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different pipes) VLIW: very long instruction word, static multiple issue (relies more on compiler technology)
This class has given you the background you need to learn more!
24
Chapter 6 Summary
Pipelining does not improve latency, but does improve throughput
Deeply pipelined
Multiple issue with deep pipeline (Section 6.10) Multiple-issue pipelined (Section 6.9) Single-cycle (Section 5.4) Pipelined Deeply pipelined
Pipelined
Slower
Faster
Several
25