CH 7
CH 7
Pipeline
Feedback
X
1
Path
Add
Add
ALU
4 result
Shift
Left 2
4
Pipelining load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock
0
M
U
X
1
Read ALU
Instruction
Read
PC register 1 Read
address Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend
6
ID Stage of load
CODE: lw $t1, 100($t2) ID/EXE() <= IF/ID (PC +4)
ID/EXE($t1) <= $t1
ID/EXE($t2) <= $t2
0
ID/EXE(singext) <= signext(100)
M
U lw
X
1
Instruction decode
Read ALU
Instruction
Read
PC register 1 Read
address Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend
7
EX Stage of load
CODE: lw $t1, 100($t2) EXE/MEM (address) <=
ID/EXE ($t2) + ID/EXE (signext(100))
0
M lw
U
X
1 Execution
Read ALU
Instruction
Read
PC register 1 Read
address Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend
8
MEM State of load
CODE: lw $t1, 100($t2) MEM/WB(Data) = Data[ EXE/MEM(address) ]
0
M
lw
U
X Memory
1
Read ALU
Instruction
Read
PC register 1 Read Data Memory
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend
9
WB Stage of load
CODE: lw $t1, 100($t2) Register($t1) <= MEM/WB(Data)
0
Who will supply
M
this address? lw
U
X
1
Write Back
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
ALU
4 result
Shift
Left 2
Read ALU
Instruction
Read
PC register 1 Read Data Memory
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend
10
The Four Stages of R-type
11
Pipelining R-type and load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
12
Important Observation
1 2 3 4 5
Load Ifetch Reg/Dec Exec Mem Wr
1 2 3 4
R-type Ifetch Reg/Dec Exec Wr
13
Solution: Delay R-type’s Write
Delay R-type’s register write by one cycle:
R-type also use Reg File’s write port at Stage 5
MEM is a NOP stage: nothing is being done.
1 2 3 4 5
R-type Ifetch Reg/Dec Exec Mem Wr
14
The Four Stages of store
Cycle 1 Cycle 2 Cycle 3 Cycle 4
15
The Three Stages of beq
Cycle 1 Cycle 2 Cycle 3 Cycle 4
WB: NOP
16
Pipelined Datapath
0
M
U
X
1
Add
Add
ALU
4 result
Shift
Left 2
ALU
Instruction
Read Read
PC register 1 Read
address Data Memory
Data 1
Read
register 2 Zero
Instruction Read
Write Data 2 ALU Read
0 result Address Data 1
register
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend
17
Graphically Representing Pipelines
Time (in clock cycles)
Program
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
execution
order
(in instructions)
lw $10, 20($1) IM Reg ALU DM Reg
Add
Add
ALU
4 result
Shift
Left 2
ALU
Instruction
Read Read
register 1 Read
PC address Data Memory
Data 1
Read
register 2 Zero
Instruction Read Read
Write Data 2 0 ALU
result Address Data 1
register
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend
19
Example 1: Cycle 2
sub $11, $2, $3 lw $10, 20($1)
Instruction fetch Instruction decode
0
M
U
X
1
Add
Add
ALU
4 result
Shift
Left 2
ALU
Instruction
Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend
20
Example 1: Cycle 3
sub $11, $2, $3 lw $10, 20($1)
Instruction decode Execution
0
M
U
X
1
Add
Add
ALU
4 result
Shift
Left 2
ALU
Instruction
Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend
21
Example 1: Cycle 4
sub $11, $2, $3 lw $10, 20($1)
0
M Execution Memory
U
X
1
Add
Add
ALU
4 result
Shift
Left 2
ALU
Instruction
Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend
22
Example 1: Cycle 5
0 sub $11, $2, $3 lw $10, 20($1)
M
U Memory Write Back
X
1
Add
Add
ALU
4 result
Shift
Left 2
ALU
Instruction
Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend
23
Example 1: Cycle 6
0
M sub $11, $2, $3
U
X
Write Back
1
Add
Add
ALU
4 result
Shift
Left 2
ALU
Instruction
Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend
24
Pipeline Control: Control Signals
PCSrc
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
1
RegDst
25
Effect of Seven
1-bit Control Signals
Signal name Effect when deasserted Effect when asserted
MemRead None Data memory contents at the read address are put on
read data output
MemWrite None Data memory contents at address given by write address
is replaced by value on write data input
ALUSrc The second ALU operand comes from the The second ALU operand is the sign-extended lower 16-
second Register fole output bits of the instruction
RegDst The register destination number for the Write The register destination number for the Write register
register comes from the rt field comes from the rd field
RegWrite None The register on the Write register input is written into with
the value on the write data input
PCSrc Thw PC is replaced by the output of the adder The PC is replaced by he output of the adder that
That computes the value of PC + 4 computes the branch target
MegtoReg The value fed to the register write data input The value fed to the register write data input comes from
comes from the ALU the data memory
The function of each of the seven control signals. When the 1-bit control to a two-
way multiplexor is asserted, the multiplexor selects the input corresponding to 1.
Otherwise, if the control is deserted, the multiplexor selects the 0 input.
Remember that the state elements all have the clock as an implicit input and that
the clock is used in controlling writes.
26
Group Signals According to Stages
Original control for single clock cycle implementation
27
Data Stationary Control
Pass control signals along just like the data
Main control generates control signals during ID
WB
Instruction
Control M WB
EX M WB
Fig. 6.26
IF/ID ID/EX EX/MEM MEM/WB
28
Data Stationary Control (cont.)
Signals for EX (ExtOp, ALUSrc, ...) are used 1 cycle later
Signals for MEM (MemWr, Branch) are used 2 cycles later
Signals for WB (MemtoReg, MemWr) are used 3 cycles later
ID EX MEM WB
ExtOp ExtOp
ALUSrc ALUSrc
MEM/WB Register
Ex/MEM Register
ALUOp ALUOp
ID/Ex Register
IF/ID Register
29
WB Stage of load
Register($t1) <= MEM/WB(Data)
0
Who will supply
M
this address? lw
U
X
1
Write Back
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
ALU
4 result
Shift
Left 2
Read ALU
Instruction
Read
PC register 1 Read Data Memory
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend
30
Datapath with Control
PCSrc
ID/EX
EX/MEM
0 WB
M
U
WB
X Control M
1 MEM/WB
M
EX WB
IF/ID
Add
Add
ALU
4 result
RegWrite
branch
Shift
Left 2
MemWrite
ALUSrc
Read
Instruction
Read ALU
MemtoReg
PC register 1 Read
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data 1
register M
M
U
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory Instruction Data Memory
16 32 6
[15:0]
Sign ALU
extend
Instruction Control MemRead
[20:16]
0
M ALUOp
Instruction
U
[15:11] X
1
RegDst
31
Summary of Pipeline Basics
Pipelining is a fundamental concept
Multiple steps using distinct resources
Utilize capabilities of datapath by pipelined instruction
processing
Start next instruction while working on the current one
Limited by length of longest stage (plus fill/flush)
Need to detect and resolve hazards
What makes it easy in MIPS?
All instructions are of the same length
Just a few instruction formats
Memory operands only in loads and stores
What makes pipelining hard? hazards
32
Hazard Detection
One of the source register number (in the pipeline register
ID/EX) is equal to the register number in the EX/MEM or
MEM/WB stage
- 1a. EX/MEM RegisterRd = ID/EX.RegisterRs = $2
- 1b. EX/MEM RegisterRd = ID/EX.RegusterRt
‧ Example
‧ sub $2, $1, $3
‧ and $12, $2, $5
- 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
- 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
‧ Example
‧ sub $2, $1, $3
‧ and $12, $2, $5
‧ or $13, $6, $2
- Condition: EX/MEM.RegisterRd != 0, MEM/WB.RegisterRd != 0
(why, see the next page)
33
Harzard Detection
This policy is inaccurate
Sometimes it would forward when unnecessary
* Some instruction do not write register
check if the RegWrite signal will be active:
Examining the WB control field of the pipeline register during the
EX and MEM stages
34
Data to be Forwarded
Time(in clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Value of register $2: 10 10 10 10 10/-20 -20 -20 -20 -20
Value if EX/MEM: X X X -20 X X X X X
Value of MEM/WB: X X X X -20 X X X X
Program
execution EX/MEMMEM/WB
order
(in instructions) ALU inputs can be from
sub $2, $s1, $s3 IM Reg DM Reg
ID/EXE any pipeline registers
(add MUX to select)
and $12, $s2, $s5 IM Reg DM Reg
ID/EXE
35
Datapath without Forwarding
ID/EX EX/MEM MEM/WB
Registers
ALU
Data M
U
X
a. No forwarding
36
Datapath with Forwarding
ID/EX EX/MEM MEM/WB
M
U
X
Registers ForwardA
ALU
M
Data M
U
X memory U
X
ForwardB
Rs
Rt
M EX/MEM.RegisterRd
Rt
Rd U
X
Forwarding MEM/WB.RegisterRd
unit
b. With forwarding
37
Forwarding Control
Three sources for the MUX
ID/EX: no forwarding, just from the register file (00)
EX/MEM: forwarded data from the prior ALU results (10)
MEM/WB: from data memory or an earlier ALU result (01)
38
Forwarding Control
Forwarding control will be in the EX stage because ALU
forwarding MUX is in this stage
Pass the operand register number from the ID stage via the
ID/EX pipeline register
We already have rt field (bits 20-16), add rs to ID/EX pipeline
register
Select
1. EX hazard forwarded data
from EX/MEM
If (EX/MEM.RegWrite stage
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd==ID/EX.RegisterRs)) ForwardA=10
If (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB=10
39
Forwarding Control
MEM hazard
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA=01
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB=01
40
Potential More Complicated Data Hazard
Between the results in the WB stage, MEM stage and ALU source
add $1, $1, $2
add $1, $1, $3
add $1, $1, $4
(vector summation)
In above case, two forwarding cases will occur but the MEM hazard
is incorrect one due to it is old one. Select the forwarded data from
EX/MEM stage.
Modified control for MEM hazard to prevent this Prevent to select
forwarded data
If (MEM/WB.RegWrite from MEM/WB
stage
and (MEM/WB.RegisterRD != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRs)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA=01
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRD != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRt)
and (MEM/WB.RegisterRd==ID/EX.RegisterRt)) ForwardB=01
41
What if Data Hazard cannot be
Solved by Forwarding
lw can still cause a hazard:
if is followed by an instruction to read the loaded reg.
lw $2, 20($1) IM Reg DM Reg
43
Stall the Pipeline
IF and ID Stage
Preserving the register value
Instruction in the IF stage will continue to be read using the
same PC
Register in the ID stage will continue to be read using the same
instruction field in the ID/EXE pipeline registers
Other stages (EX, MEM, WB)
Insert “NOP” instruction: do nothing
That is: deasserting all nine control signals (set to 0)
No register or memories are written if the control are all 0
44
Stall the Pipeline
45
Handling Stalls
Hazard detection unit in ID to insert stall between a load
instruction and its use:
if (ID/EX.MemRead and
((ID/EX.RegisterRt = = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = = IF/ID.registerRt))
stall the pipeline for one cycle
(ID/EX.MemRead=1 indicates a load instruction)
How to stall?
Stall instruction in IF and ID: not change PC and IF/ID
=> the stages re-execute the instructions
What to move into EX: insert an NOP by changing EX, MEM, WB
control fields of ID/EX pipeline register to 0
as control signals propagate, all control signals to EX, MEM, WB are
deasserted and no registers or memories are written
46
Branch Hazards
When decide to branch, other inst. are in pipeline!
(instructions)
72 lw $4, 50($7)
IM Reg DM Reg
48
Solving Control Hazards
Reducing the delays of branches
Concept
Move the branch execution earlier in the pipeline, then fewer
instructions need flushed.
How MIPS designer do?
Make the common case fast
Many branches rely only on simple tests (equality or sign)
Such test can be done with a few gates without full ALU
49
Solving Control Hazards
Move the branch execution from MEM to the ID stage
2 steps in the ID stage
1. compute the branch target (PC + offset)
Move the branch adder from the EXE stage to the ID stage
2. evaluate the branch decision
Equality test of two registers
XOR their respective bits and OR all the results
Flush instructions in the IF stage
New control line: IF.Flush
Zero the instruction field of the IF/ID pipeline registers (=>NOP)
50
Dynamic Branch Prediction
Performance = ƒ(accuracy, cost of misprediction)
Branch prediction buffer or branch history table
A small memory indexed by the lower portion of the address of
branch instruction. The memory contains a bit that say whether
the branch was recently taken or not.
No address check
Simplest one: 1-bit prediction
51
Basic Branch Prediction Buffers
Branch History Table (BHT) - Small direct-mapped cache of T/NT bits
Branch Instruction
IR:
+ Branch Target
PC:
BHT T (predict taken)
PC + 4
52
Branch Prediction Buffer
A branch prediction buffer can be implemented as a small
cache accessed during the IF stage.
53
Dynamic Branch Prediction
2-bit Prediction Scheme
A prediction must be wrong twice before it is changed
Suitable for strongly favors taken or not taken
Mispredicted once
Advanced ones
Correlating predictors
Global and local branch
Tournament predictor
Multiple predictions for each branch
54
Scheduling the Branch Delay Slot
a. From before b. From target c. From fall through
add $s1, $s2, $s3 sub $t4, $t5, $t6 add $s1, $s2, $s3
Hazard
detection
unit
M
ID/EX U
0 X
WB
M M EX/MEM
Control U M U WB MEM/WB
X Cause X
IF/ID + 0 EX EPC
0 M WB
+ Shift
Left 2 M
4
U
X
M
Registers - M
Instruction
ALU U
00000100 U PC M
memory Data X
X U memory
X
Signed
extend
M
U
X
Forwarding
unit
56
Instruction Level Parallelism
Reference for more details
Computer architecture: A quantitative approach
Two methods to increase ILP
Increase the pipeline depth
More operations being overlapped
Pipeline speedup α pipeline depth
8 or more pipeline stages
To get the speedup, rebalance the remaining steps
Performance is potentially greater due to shorter the clock cycle
Multiple issue
Issue 3 to 8 instructions in every clock cycle
Static multiple issue: determined at compile time
Dynamic multiple issue: determined during execution
Two problems
How to package instruction into issue slots (by compiler or hardware)
Dealing with data and control hazards (by compiler or hardware)
57
Speculation: Find and Exploit more ILP
Speculation
An approach that allows the compiler or the processor to “guess”
the outcome of an instruction to remove it as a dependence in
executing other instructions
E.g. branch, store before load
How it works
Compiler or processor use speculation to
reorder instructions,
move an instruction across a branch or
a load across a store
Mechanism
A method to check if guess right and a method to back out the
effects
Difficulty: what if guess wrong (back-out capability)
58
Speculation: Find and Exploit more ILP
Hardware approach
Buffer the speculative results until no longer speculative
If correct, complete the instruction (write results to registers)
If incorrect, flush the buffer and re-execute the correct one
59
Speculation: Find and Exploit more ILP
60
Static Multiple Issue
Compiler assist packaging instruction and handling data
hazards
Issue packet
As one large instruction with multiple operations
VLIW: very long instruction word
EPIC: Explicitly Parallel Instruction Computer (IA-64)
Variation: how compiler handle hazards
1. Compilers handle all hazards, schedule code, and insert code
2. Compiler handle all dependences within an instruction, and
hardware detects data hazards and generates stalls between
two issue packets
61
Two-Issue MIPS Processor
Static two-issue pipeline (64-bits IF and ID)
62
Multiple Issue Code Scheduling
Original : add scalar $s2 to array
Loop:
lw $t0, 0($s1) #$t0 = array element
addu $t0, $t0, $s2 #add scalar in $s2
sw $t0, 0($s1) #store result
addi $s1, $s1, -4 #decrement pointer
bne $s1, $zero,loop #branch $s1 != 0
63
Loop Unrolling for 2-Issue MIPS
To get more performance from loops: loop unrolling
Assume the loop index is multiple of four
Unroll four loop: register renaming to remove antidependences
Loop:
addi $s1, $s1, -16 lw $t0, 0($s1)
lw $t1, 12($s1)
addu $t0, $t0, $s2 lw $t2, 8($s1)
addu $t1, $t1, $s2 lw $t3, 4($s1)
addu $t2, $t2, $s2 sw $t0, 16($s1)
addu $t3, $t3, $s2 sw $t1, 12($s1)
sw $t2, 8($s1)
beq $s1, $zero, loop sw $t3, 4($s1)
for (i = 0; i < 16; i++){ for (i = 0; i < 16; i+4){
array[i] = array[i] + scalar; array[i] = array[i] + scalar;
} array[i+1] = array[i+1] + scalar;
array[i+2] = array[i+2] + scalar;
array[i+3] = array[i+3] + scalar;
}
64
Dynamic Multiple
Issue Processors
Suplerscalar
Instruction issue in order
0, 1 or more instructions can issue in a give clock cycle
To achieve good performance
Needs compiler to schedule instructions
More important: hardware guarantees instructions are executed
correctly whether scheduled or not
Extension: dynamic pipeline scheduling
Hardware support to reorder the execution order to avoid stalls
65
Dynamic Pipeline Scheduling
Instruction fetch
and decode unit In-order issue
Commit
unit
In-order commit
66
Summary
Performance is specific to a particular program/s
Total execution time is a consistent summary of performance
For a given architecture performance increases come from:
increases in clock rate (without adverse CPI affects)
improvements in processor organization that lower CPI
compiler enhancements that lower CPI and/or instruction count
Algorithm/Language choices that affect instruction count
Amdahl’s law
68
See You Next Class!
69