CSCE 5610 Computer System Architecture: Instruction Level Parallelism
CSCE 5610 Computer System Architecture: Instruction Level Parallelism
CSCE 5610
Computer System Architecture
30 40 20 30 40 20 30 40 20 30 40 20
— One drier (takes 40 minutes)
I [15 - 0] Sign
extend
RegWrite RegWrite
MemWrite MemToReg MemWrite MemToReg
Read Instruction I [25 - 21] Read Instruction I [25 - 21]
Read Read Read Read
address [31-0] address [31-0]
register 1 data 1 register 1 data 1
ALU Read Read 1 ALU Read Read 1
I [20 - 16] Zero address data I [20 - 16]
Read M Zero address data
Instruction Read M
register 2 Read 0 Result Write u Instruction Read 0 u
memory 0 0 register 2 Result Write
data 2 M address x memory data 2 x
M Write M address
u 0 M Write
u Data u Data 0
register x Write u register x
x Registers memory x Write
I [15 - 11] 1 ALUOp data Registers memory
1 Write I [15 - 11] Write 1 ALUOp data
1
data data
MemRead MemRead
RegDst ALUSrc ALUSrc
RegDst
I [15 - 0] Sign I [15 - 0] Sign
extend extend
RegWrite
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21] MemWrite MemToReg
Read Read Read Instruction I [25 - 21]
address [31-0] Read
register 1 data 1 address [31-0] Read
ALU Read Read 1
I [20 - 16] Zero address data register 1 data 1
Read M ALU Read Read 1
Instruction I [20 - 16] Zero address data
register 2 Read 0 Result Write u Read M
memory 0 Instruction
data 2 M x register 2 Read 0 Result Write u
M address memory 0
Write u data 2 M x
u Data 0 M address
register x Write Write u
x memory u Data 0
I [15 - 11] Registers 1 ALUOp data register x Write
1 Write x Registers memory
I [15 - 11] Write 1 ALUOp data
data 1
MemRead data
RegDst ALUSrc
MemRead
RegDst ALUSrc
I [15 - 0] Sign
I [15 - 0] Sign
extend
extend
Writeback (WB) A Bunch of Lazy Functional Units
• Finally, in the Writeback (WB) step, the memory value is stored into the • Notice that each execution step uses a different functional unit.
destination register.
• In other words, the main units are idle for most of the 8ns cycle!
— The instruction RAM is used for just 2ns at the start of the cycle.
— Registers are read once in ID (1ns), and written once in WB (1ns).
RegWrite
— The ALU is used for 2ns near the middle of the cycle.
Read I [25 - 21]
MemWrite MemToReg — Reading the data memory only takes 2ns as well.
Instruction Read Read
register 1 data 1 ALU Read Read 1
• That’s a lot of hardware sitting around doing nothing.
address [31-0] 16] address data
Zero
Read M
I register 2 Read 0 Result Write u
0
[2 data 2 M address x
M Write
0- u Data 0
u register x
x Write
Instruction Registers memory
I [15 - 11] Write 1 ALUOp data
1
data
memory MemRead
RegDst ALUSrc
I [15 - 0] Sign
extend
RegWrite
RegWrite
MemWrite MemToReg
MemWrite MemToReg Read Instruction I [25 - 21]
Read Instruction I [25 - 21] Read Read
Read Read address [31-0]
address [31-0] register 1 data 1
register 1 data 1 ALU Read Read 1
ALU Read Read 1 I [20 - 16] Zero address data
I [20 - 16] Zero address data Read M
Read M Instruction Read 0 Write u
Instruction 0 register 2 Result
register 2 Read 0 Result Write u memory x
memory 0 data 2 M address
data 2 M address x M Write
M Write u Data 0
u 0 u register x
u register Data x Write
x Write Registers memory
x Registers memory I [15 - 11] Write 1 ALUOp data
I [15 - 11] 1 ALUOp data 1
1 Write
data
data
MemRead
MemRead RegDst ALUSrc
RegDst ALUSrc
I [15 - 0] Sign
I [15 - 0] Sign extend
extend
Executing, Decoding and Fetching Break Datapath into 5 Stages
• Similarly, once the first instruction enters its Execute stage, we can go ahead and • Each stage has its own functional units.
decode the second instruction. • Each stage can execute in 2ns
• But now the instruction memory is free again, so we can fetch the third
instruction!
IF ID EXE MEM WB
Fetch 3rd Execute 1st
Decode 2nd
RegWrite
RegWrite MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
MemWrite MemToReg address [31-0]
Read Instruction I [25 - 21] register 1 data 1
Read Read ALU Read Read 1
address [31-0] I [20 - 16] Zero address data
register 1 data 1 Read M
ALU Read Read 1 Instruction
I [20 - 16] Zero address data register 2 Read 0 Result Write u
Read M memory 0
Instruction data 2 M address x
register 2 Read 0 Result Write u M Write
memory 0 u Data 0
data 2 M address x u register
M Write x Write
u 0 x Registers memory
u Data I [15 - 11] 1 ALUOp data
register x Write 1 Write
x Registers memory
I [15 - 11] 1 ALUOp data data
1 Write
MemRead
data RegDst ALUSrc
MemRead
RegDst ALUSrc I [15 - 0] Sign
I [15 - 0] extend
Sign
extend
IF ID EX ME WB IF ID EX ME WB
M M
A solution: Insert NOP stages Pipeline Registers
• Enforce uniformity • We’ll add intermediate registers to our pipelined datapath too.
— Make all instructions take 5 cycles. • There’s a lot of information to save, however. We’ll simplify our diagrams by
— Make them have the same stages, in the same order drawing just one big pipeline register between each stage.
• Some stages will do nothing for some instructions • The registers are named for the stages they connect.
R-type IF ID
EX NO W IF/ID ID/EX EX/MEM MEM/WB
P B
Clock cycle
• No register is needed after the WB stage, because after WB the instruction is
1 2 3 4 5 6 7 8 9
add $sp, $sp, -4 sub $v0,IF$a0, $a1 lw done.
ID $t0, 4($sp)
EX NOP WB
or $s0, $s1, $s2 lw $t1, 8($sp)
IF ID EX NOP WB
IF ID EX MEM WB
• Stores and Branches have NOP stages, too…
IF ID EX NOP WB
IF ID EX MEM WB
store IF ID EX ME NO
M P
branch IF ID EX NO NO
P P
Cycle 8 Cycle 9
That’s a lot of Diagrams There Performance Revisited
Clock cycle • Assuming the following functional unit latencies:
1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WB 3ns 2ns 2ns 3ns 2ns
sub $v0, $a0, $a1 and $t1, $t2, $t3
or $s0, $s1, $s2 add $t5, $t6, $0 IF ID EX MEM WB Inst mem Reg Read Data Reg
A
Mem Write
IF ID EX MEM WB L
U
• Compare the last nine slides with the pipeline
IF diagram
ID above.
EX MEM WB
— You can see how instruction executions are overlapped.
IF ID EX MEM WB
— Each functional unit is used by a different instruction in each cycle. • What is the cycle time of a single-cycle implementation?
— The pipeline registers save control and data values generated in previous
— What is its throughput (how many works/instr. finished in a unit of time)?
clock cycles for later use.
— When the pipeline is full in clock cycle 5, all of the hardware units are
utilized. This is the ideal situation, and what makes pipelined processors so • What is the cycle time of a ideal pipelined
fast.
implementation?
— What is its steady-state throughput?