Pipelining
Pipelining
Organization &
Architecture
Pipelining
Time
30 40 20 30 40 20 30 40 20 30 40 20
Traditional Concept: Laundry System
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
D
Traditional Concept: Laundry System
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
k A
• Pipelined laundry takes 3.5
hours for 4 loads
O B
r
d C
e
r D
Traditional Concept: Laundry System
6 PM 7 8 9
• Pipelining doesn’t help latency
Time of single task, it helps
throughput of entire workload
T
a 30 40 40 40 40 20 • Pipeline rate limited by slowest
pipeline stage
s
A • Multiple tasks operating
k
simultaneously using different
resources
O
B • Potential speedup = Number
r pipe stages
d
• Unbalanced lengths of pipe
e C stages reduces speedup
r • Time to “fill” pipeline and time
D
to “drain” it reduces speedup
• Stall for Dependences
Pipelining
• Pipelining is a general-purpose efficiency technique
– It is not specific to processors
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1
ALU Read Read 1
I [20 - 16]
Read Zero address data M
Instruction
register 2 Read 0 u
memory 0 Result Write
data 2 M x
M Write address
register u Data 0
u x Write
x Registers memory
I [15 - 11] Write 1 ALUOp data
1 data
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
Execute (EX)
The third step, Execute (EX), computes the effective memory address from the source register
and the instruction’s constant field.
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1
ALU Read Read 1
I [20 - 16]
Read Zero address data M
Instruction
register 2 Read 0 u
memory 0 Result Write
data 2 M x
M Write address
register u Data 0
u x Write
x Registers memory
I [15 - 11] Write 1 ALUOp data
1 data
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
Memory (MEM)
The Memory (MEM) step involves reading the data memory, from the address
computed by the ALU.
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1
ALU Read Read 1
I [20 - 16]
Read Zero address data M
Instruction
register 2 Read 0 u
memory 0 Result Write
data 2 M x
M Write address
register u Data 0
u x Write
x Registers memory
I [15 - 11] Write 1 ALUOp data
1 data
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
Writeback (WB)
• Finally, in the Writeback (WB) step, the memory value
is stored into the destination register.
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1
ALU Read Read 1
I [20 - 16]
Read Zero address data M
Instruction
register 2 Read 0 u
memory 0 Result Write
data 2 M x
M Write address
register u Data 0
u x Write
x Registers memory
I [15 - 11] Write 1 ALUOp data
1 data
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
A bunch of lazy functional units
• Notice that each execution step uses a different functional unit.
• In other words, the main units are idle for most of the 8ns cycle!
– The instruction RAM is used for just 2ns at the start of the cycle.
– Registers are read once in ID (1ns), and written once in WB (1ns).
– The ALU is used for 2ns near the middle of the cycle.
– Reading the data memory only takes 2ns as well.
• That’s a lot of hardware sitting around doing nothing.
Putting those slackers to work
• We shouldn’t have to wait for the entire instruction to complete before we
can re-use the functional units.
• For example, the instruction memory is free in the Instruction Decode step
as shown below, so...
Idle Instruction Decode (ID)
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1 ALU Read Read 1
I [20 - 16]
Read Zero address data
Instruction M
register 2 Read 0
memory 0 Result Write u
Write data 2 M
M address x
register u Data
u Write 0
Registers x memory
I [15 - 11] x Write ALUOp data
data 1
1
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
Decoding and fetching together
• Why don’t we go ahead and fetch the next instruction while we’re
decoding the first one?
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1 ALU Read Read 1
I [20 - 16]
Read Zero address data
Instruction M
register 2 Read 0
memory 0 Result Write u
Write data 2 M
M address x
register u Data
u Write 0
Registers x memory
I [15 - 11] x Write ALUOp data
data 1
1
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
Executing, decoding and fetching
• Similarly, once the first instruction enters its Execute stage, we can go
ahead and decode the second instruction.
• But now the instruction memory is free again, so we can fetch the third
instruction!
Fetch 3rd Decode 2nd Execute 1st
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21] Read Read
address [31-0] register 1 data 1
I [20 - 16] ALU Read Read 1
Instruction Read Zero address data
Read 0 M
memory 0 register 2 Result Write u
Write data 2 M address
M x
u register
Registers
u Write Data
0
I [15 - 11] x Write x ALUOp data memory
data 1
1 MemRead
RegDst ALUSrc
I [15 - 0] Sign
extend
Making Pipelining Work
• We’ll make our pipeline 5 stages long, to handle load
instructions as they were handled in the multi-cycle
implementation
– Stages are: IF, ID, EX, MEM, and WB
• We want to support executing 5 instructions simultaneously:
one in each stage.
Break datapath into 5 stages
• Each stage has its own functional units.
• Each stage can execute in 2ns
• Just like the multi-cycle implementation
IF ID EXE MEM WB
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1
ALU Read Read 1
I [20 - 16]
Read Zero address data M
Instruction
register 2 Read 0 u
memory 0 Result Write
data 2 M x
M Write address
register u Data 0
u x Write
x Registers memory
I [15 - 11] Write 1 ALUOp data
1 data
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend