Lecture13 Pipeline2
Lecture13 Pipeline2
Review of the
basic pipeline
architecture
DADD R1 R2 R3
DSUB R4 R1 R5
AND R6 R1 R7
OR R8 R1 R9
BEQ R1 R4 offset
The
challenges of
data sharing
(Hazards)
Data Hazards
RAW
WAR
RAR
WAW
Data Dependency: Solutions
Data
Forwarding
/ Bypassing
/Short
circuiting
Data Hazards
RAW
WAR
RAR
WAW
DADD R1 R2 R3
DSUB R4 R1 R5
AND R6 R1 R7
OR R8 R1 R9
BEQ R1 R4 offset
DADD R1 R2 R3
DSUB R4 R1 R5
AND R6 R1 R7
OR R8 R1 R9
BEQ R1 R4 offset
XOR R1 R4 R11
PC = 00: DADD R1 R2 R3
PC = 04: DSUB R4 R1 R5
PC = 08: AND R6 R1 R7
PC = 16: OR R8 R1 R9
PC = 24 + offset: …….
Branch Prediction:Control Dependency
The control
hazards
(Causes the
break of normal
pipeline flow)
Predict the
control path
(branch prediction)
- Branching decision (taken or Not taken) Both have been moved to ID,
- Branch target address (effective address) Just we don’t want to wait!
MIPS implementation
RISC-V implementation
Therefore, with hazards, the cycle per instruction = 1 + stall cycles per instruction
Pipeline depth
Speed up =
1 + stall cycles per instruction
Deep Pipeline Architecture
Can this be increased?
Pipeline depth
Speed up =
1 + stall cycles per instruction
The question is how to increase pipeline depth without increasing stall cycles?
Integer Unit
Floating point/Integer multiply
FP Adder
FP/Integer Divider
Assumption:
Production
FP Pipeline Architecture
Latency and Initiation interval for each of the pipeline units. Note that in
case of FP divider the initiation interval is 25 instead of 1.
(= 1 - 1)
(= 2 - 1)
(= 4 - 1)
(= 7 - 1)
( = 25 - 1 )
Figure is from 6th edition of the text book, however, for your reading you may take 5th edition,
There are some printing bugs in 6th edition.
FP Pipeline Architecture: Hazards
Hazards and forwarding in long latency pipeline:
Data dependency (Consumer must get the updated data)
Control (Unpredictable control path)
Structural hazards (No two stages can access a single resources at a time)
2) Due to varying running time of each of the instruction, the number of register
writes required in a cycle is more than 1.
RAW
f4, f0, f2 are the floating dependency
point registers leading to
RAW dependency.
Due to structural
hazards
RAW
dependency
FP Pipeline Architecture: Solutions
Solutions to hazards.
Situation for WAW to cause issue: if fld f2, 0(x2) would have been issued
a cycle before. fld f2, 0(x2) and fadd.d f2, f4, f6 would cause WAW. WAW
FP Pipeline Architecture: Solutions
Solutions to hazards.
Structural hazards due to WB and MEM:
- detect hazards and stall
- detection can be done at ID stage or at MEM stage
- stall the issue at ID stage or stall before entering to MEM or WB
Question of interest:
CPU_time = time/program
= Instruction/Program x Cycle/Instruction x Time/Cycle
Speed up = Performance of new / Performance of Old
= CPU time in Old / CPU time in new
In the 1980’s (decade of pipelining):
CPI: between 5.0 to 1.15
In the 1990’s (decade of superscalar):
CPI: between 1.15 to 0.5 (best case)
In the 2000’s (decade of multicore):
Focus on thread-level parallelism, CPI near to 0.33 (best
case)
Limits of Pipeline
Amdhal's Speed up = P1/P2
Law P1: Performance for entire task using the enhancement
P2: Performance for entire task without enhancement
N
No. of
Processors
h 1-h f
1 1-f
Time
h = fraction of time in serial code
f = fraction that is vectorizable
v = speedup for f 1
Overall speedup: Speedup=
f
1− f +
v
Limits of Pipeline
Amdhal's
Law
N
No. of
Processors
h 1-h f
1 1-f
Time
Sequential bottle neck
Even if v is infinite, the performance is limited by non-vectorizable code
i.e 1-f
1 1
lim
v f 1 f
1 f
v
Limits of Pipeline
Pipeline Performance Model:
Pipeline
Depth
1
1-g g
g = fraction of time pipeline is filled
1-g = fraction of time pipeline is not filled (stalled)
Limits of Pipeline
Pipeline Performance Model:
Pipeline
Depth
1
1-g g
g = fraction of time pipeline is filled
1-g = fraction of time pipeline is not filled (stalled)
Beyond Scalar Pipeline
Typical Range
Limits of Pipeline
1 f - fraction vectorizable
Speedup(N) = N - number of processors
(1-f) + f/N
The challenge of
Amdhal's Law
Look at the
90%
and
95%
Limits of Pipeline
1
IF DE EX WB
2
3
4
5
6
0 1 2 3 4 5 6 7 8 9
TIME IN CYCLES (OF BASELINE MACHINE)
1
2
3
4
5
6
IF DE EX WB
1 2 3 4 5 6
1
2
3
4
5
6
7
8
9
IF DE EX WB
IF DE WB
EX
The compiler:
Startup overhead
Software Pipelined
performance
Loop time
Iteration
Next Lecture
Pipeline to continue...