Pipe 4
Pipe 4
Basics
° Pipelines pass control information down the pipe just
as data moves down pipe
° Forwarding/Stalls handled by local control
° Hazards limit performance
• Structural: need more HW resources
• Data: need forwarding, compiler scheduling
• Control: early evaluation & PC, delayed branch, prediction
IF DCD OF Ex Mem
IF DCD EX Mem WB
WAW Data Hazard
IF DCD EX Mem WB
IF DCD OF Ex Mem
Datapath Output
° Today’s Topics:
• Recap last lecture
• Review MIPS R3000 pipeline
• Advanced Pipelining
• SuperScalar
phi2
Resource Usage
TLB TLB
I-cache
RF WB
ALUALU
D-Cache
ALU
I add r1,r2,r3 Im Reg Dm Reg
ALU
Im Reg Dm Reg
s
t
sub r4,r1,r3
r.
ALU
Im Reg Dm Reg
and r6,r1,r7
O
ALU
r Im Reg Dm Reg
d or r8,r1,r9
e
ALU
Im Reg Dm Reg
r xor r10,r1,r11
Rd T
to reg
file
M
40000040 u
x
M
u
x
4
ALU
M
Registers u
Instruction x
PC
memory Write
data
Data
memory
M
u
x
Superscalar additions: 32 more bits from instr memory, 2 read ports + 1 write
port for regs file, 1 more ALU (top ALU for address calculation, bottom ALU
for all else).
cs 152 L1 5 .12 DAP Fa97, U.CB
Superscalar Characteristics
Floating Load/
Functional Integer Integer … Out-of-order execute
units point Store
In-order commit
Commit
unit
T
NT
Predict Taken Predict Taken
T
T NT NT
Predict Not Predict Not
Taken T Taken
NT
Decode/dispatch unit
Reser vation Reser vation Reser vation Reser vation Reser vation Reser vation
station station station station station station
Store Load
Floating
Branch Integer Integer Complex Load/
point
integer store
Commit
unit
Reorder
buf fer
PPro central reservation station for any functional unit with one bus
shared by a branch and an integer unit
cs 152 L1 5 .34 DAP Fa97, U.CB
Dynamic Scheduling in PowerPC 604 and Pentium Pro
2.5
Relative performance
2.0
1.5
1.0
0.5
0.0
1 2 4 8 16
Pipeline depth
cs 152 L1 5 .39 DAP Fa97, U.CB
Limits to Multi-Issue
Machines
° Need about Pipeline Depth x No. Functional Units of
independent instrs. Difficulties in building HW.
° Duplicate FUs to get parallel execution
° Increase ports to Register File
° Increase ports to memory
° Decoding SS and impact on clock rate, pipeline depth.
° Limitations specific to either SS or VLIW
implementation
• Decode issue in SS
• VLIW code size: unroll loops + wasted fields in
VLIW
Rd T
to reg
file
Limitation
IF D Ex M W
° Pipelining IF D Ex M W
IF D Ex M W Issue rate, FU stalls, FU depth
IF D Ex M W
° Super-pipeline
- Issue one instruction per (fast) cycle
- ALU takes multiple cycles IF D Ex M W
IF D Ex M W
Clock skew, FU stalls, FU depth
IF D Ex M W
IF D Ex M W
° VLIW (“EPIC”)
- Each instruction specifies IF D Ex M W Packing
multiple scalar operations Ex M W
Ex M W
- Compiler determines parallelism
Ex M W
Hazard
detection
unit
M ID/EX
u M
40000040 u
x
WB x
0 EX/MEM
M M
Control u M u WB
x x MEM/WB
0
0
EX Cause M WB
IF/ID
RegWrite
MemWrite
Except
PC
4 Shift ALUSrc
left 2 Read
Read
MemtoReg
data 1 M
register 1
u Data
Instruction Read x memory
=
Instruction
memory register 2
Registers
PC Address Write ALU Address
register Read Read
Read M
data Write data 2 M data u
M Write x
data u data
x u
x ALU
control
16 32 MemRead
Sign
extend ALUOp
Multicycle Pipelined
Faster
datapath datapath
(section 5.4) (Chapter 6)
Clock rate
Single-cycle
Slower
datapath
(section 5.3)
Slower Faster
Instruction throughput
(instructions per clock cycle or 1/CPI)
Specialized
Single-cycle Pipelined
datapath datapath
(section 5.3) (Chapter 6)
Hardware
Multicycle
Shared
datapath
(section 5.4)
1 Several
Clock cycles of latency for an instruction
Today
early 90's RISC
Superscalars
(IBM Power 1 and Power PC)
80's RISC
pipelines
vector proc. (mips,sparc,IBM RS6000) 80ns,
Cache 2Kb Ctrl. St
(ibm 360/85, ...) 4x16b bus
960ns mem
Load/Store ISA Dynamic Inst. 32KB cache
(cdc 6600,7600, Scheduling with 60-160ns
Cray-1, . . .) extensive pipelining
(ibm 360/91)
1966 25x basic model Virtual Memory
(multics, ge-645,
1967 ibm 360/67, ...)
60ns TLB
hardwired Inst. Pipelining
8x16b bus Inst. Buffering
780ns mem (Stretch (IBM7030)
- 100x ibm704 Microprogramming
1961
cs 152 L1 5 .50 DAP Fa97, U.CB