CS 162 Computer Architecture Lecture 3: Pipelining Contd.: Instructor: L.N. Bhuyan
CS 162 Computer Architecture Lecture 3: Pipelining Contd.: Instructor: L.N. Bhuyan
1 1999 ©UCB
Single Cycle Datapath (From Ch 5)
M
a a u
d d x
4 d << d
2 PCSrc
Read 25:21 Read MemWrite
P Addr Reg1
Read Read
C
31:0 Read data1 Zero data
20:16
Instruc- Reg2
A
tion L
Read Address
M Write U MemTo-
data2 M
u Reg Reg
u
Imem x Regs x
Dmem
Write ALU-
15:11 con Write
Data
Data
RegDst ALU- M
RegWrite src MemRead u
15:0 Sign
Extend x
4 1999 ©UCB
Pipelined Datapath (with Pipeline Regs)
(6.2)Fetch Decode Execute Memory Write
Back
0
M
u
x
1
Add
4 Add
result
Shift
left 2
Read
Ins tructio n
PC Address register 1
Read
data 1
Read
register 2 Zero
Read ALU ALU
Write 0 Address Read
data 2 result 1
register M data
u M
Imem Write
data Regs x
1
u
x
0
Write
16 32
data
Dmem
Sign
extend
5
64 bits 133 bits 102 bits 69 bits
1999 ©UCB
Pipelined Control
(6.3)
• Start with single-cycle controller
• Group control lines by pipeline stage needed
• Extend pipeline registers with control bits
WB
Instruction Mem
Control WB
EX Mem WB
RegDst
Branch MemToReg
ALUop
MemRead RegWrite
ALUSrc
MemWrite
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add resul t
RegWrite
Sh if t Branch
MemWrite
left 2
MemToReg
ALUSrc
Instructi on
Read
PC Address regis ter 1 Read
Read data 1
regis ter 2 Zero
Read ALU ALU
Writ e 0 Read
data 2 result Address 1
Imem regis ter M
u
data
M
Regs
Writ e x u
data x
1
Dmem
0
Write
data
Instruction 16 32
[15– 0] 6
Si gn ALU MemRead
ex tend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
7 1999 ©UCB
Reca
p
° if can keep all pipeline stages busy,
can retire (complete) up to one
instruction per clock cycle (thereby
achieving single-cycle throughput)
° The pipeline paradox (for MIPS): any
instruction still takes 5 cycles to
execute (even though can retire one
instruction per cycle)
8 1999 ©UCB
Problems for Pipelining
° Hazards prevent next instruction from
executing during its designated clock
cycle, limiting speedup
• Structural hazards: HW cannot support
this combination of instructions (single
memory for instruction and data)
• Data hazards: Instruction depends on
result of prior instruction still in the
pipeline
• Control hazards: conditional branches &
other instructions may stall the pipeline
delaying later instructions
9 1999 ©UCB
Single Memory is a Structural
Hazard
Time (clock cycles)
I
n
ALU
M Reg M Reg
s Load
ALU
t Instr 1 M Reg M Reg
r.
ALU
M Reg M Reg
Instr 2
O
ALU
M Reg M Reg
Instr 3
r
ALU
d Instr 4 M Reg M Reg
e
r
10
• Can’t read same memory twice in same clock cycle
1999 ©UCB
EX: MIPS multicycle datapath:
Structural Hazard in Memory
11 1999 ©UCB
Structural Hazards limit
performance
° Example: if 1.3 memory accesses per
instruction (30% of instructions
execute loads and stores)
and only one memory access per cycle
then
• Average CPI 1.3
• Otherwise datapath resource is more than
100% utilized
13 1999 ©UCB
Example: Dual-port vs. Single-port
° Machine A: Dual ported memory
° Machine B: Single ported memory, but its pipelined implementation
has a 1.05 times faster clock rate
° Ideal CPI = 1 for both
° Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
14 1999 ©UCB
Data Hazard on Register $1
(6.4)
add $1 ,$2, $3
or $8, $1 ,$9
15 1999 ©UCB
Data Hazard
Solution:
• “Forward” result from one stage to another
I Time (clock cycles)
IF ID/RF EX MEM WB
n
ALU
s add $1,$2,$3 IM Reg DM Reg
ALU
IM Reg DM Reg
sub $4,$1,$3
r.
ALU
IM Reg DM Reg
and $6,$1,$7
O
ALU
IM Reg DM Reg
r or $8,$1,$9
d
ALU
IM Reg DM Reg
xor $10,$1,$11
e
r
• “or” OK if implement register file properly
16 1999 ©UCB
Hazard Detection for Forwarding
° A hazard must be detected just before execution so that
in case of hazard, the data can be forwarded to the
input of the ALU.
° It can be detected when a source register (Rs or Rt or
both) of the instruction at the EX stage is equal to the
destination register (Rd) of an instruction in the
pipeline (either in MEM or WB stage)
° Compare the values of Rs and Rt registers in the ID/EX
stage with Rd at EX/MEM and MEM/WB stages =>
Need to carry Rs, Rt, Rd values to the ID/EX register
from the IF/ID register (only Rd was carried before)
° If they match, forward the data to the input of the ALU
through the multiplexor.
IF ID/RF EX MEM WB
ALU
lw $1,0($2) IM Reg DM Reg
ALU
IM Reg DM Reg
sub $4,$1,$3
IF ID/RF EX MEM WB
lw $1, 0($2)
ALU
IM Reg DM Reg
bub
ALU
sub $4,$1,$6 IM Reg
ble
DM Reg
bub
ALU
IM Reg DM Reg
and $6,$1,$7 ble
bub
ALU
or $8,$1,$9 ble
IM Reg DM
19 1999 ©UCB
Compiler Schemes to Improve Load Delay
° Compiler will detect data dependency and inserts
nop instructions until data is available
sub $2, $1, $3
nop
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
° Compiler will find independent instructions to
fill in the delay slots
20 1999 ©UCB
Software Scheduling to Avoid Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e
LW Rf,f
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SUB Rd,Re,Rf
SW d,Rd
SW d,Rd
21 1999 ©UCB