0% found this document useful (0 votes)
6 views68 pages

CH 7

The document discusses the concept of pipelining in computer organization, detailing the stages of instruction processing including instruction fetch, decode, execute, memory access, and write back. It highlights the importance of pipeline registers and addresses potential hazards such as structural hazards when multiple instructions attempt to access the same resources. Additionally, it provides examples of how different instruction types (load, store, R-type, and branch) are processed in a pipelined architecture.

Uploaded by

秦槐駿
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views68 pages

CH 7

The document discusses the concept of pipelining in computer organization, detailing the stages of instruction processing including instruction fetch, decode, execute, memory access, and write back. It highlights the importance of pipeline registers and addresses potential hazards such as structural hazards when multiple instructions attempt to access the same resources. Additionally, it provides examples of how different instruction types (load, store, R-type, and branch) are processed in a pipelined architecture.

Uploaded by

秦槐駿
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Computer Organization

Pipeline

Prof. Ya-Shu Chen


National Taiwan University of Science and Technology
1
Split Single-cycle Datapath
IF:Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write Back
register file read address: calculation
0
M
U

Feedback
X
1

Path
Add
Add
ALU
4 result
Shift
Left 2

Read Read ALU


PC register 1 Read
addrsss Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data 1
register M M
U U
Write X X
Data 1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend

What to add to split the datapath into stages?


2
Add Pipeline Registers
0
Pipeline registers (latches)
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB


Add
Add
ALU
4 result
Shift
Left 2

Read Read ALU


PC register 1 Read
addresss Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

 Use registers between stages to carry data and control


3
Consider load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load Ifetch Reg/Dec Exec Mem Wr

 IF: Instruction Fetch


 Fetch the instruction from the Instruction Memory
 ID: Instruction Decode
 Registers fetch and instruction decode
 EX: Calculate the memory address
 MEM: Read the data from the Data Memory
 WB: Write the data back to the register file

4
Pipelining load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock

1st lw Ifetch Reg/Dec Exec Mem Wr


2nd lw Ifetch Reg/Dec Exec Mem Wr

3rd lw Ifetch Reg/Dec Exec Mem Wr

 5 functional units in the pipeline datapath are:


 Instruction Memory for the Ifetch stage
 Register File’s Read ports (busA and busB) for the Reg/Dec
stage
 ALU for the Exec stage
 Data Memory for the MEM stage
 Register File’s Write port (busW) for the WB stage
5
IF Stage of load
CODE: lw $t1, 100($t2) PC = PC + 4
IF/ID <= PC+4
lw
Instruction fetch
IR, PC+4 IF/ID <= MEM[PC]

0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB


Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

Read
PC register 1 Read
address Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

6
ID Stage of load
CODE: lw $t1, 100($t2) ID/EXE() <= IF/ID (PC +4)
ID/EXE($t1) <= $t1
ID/EXE($t2) <= $t2
0
ID/EXE(singext) <= signext(100)
M
U lw
X
1
Instruction decode

IF/ID ID/EX EX/MEM MEM/WB


Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

Read
PC register 1 Read
address Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

7
EX Stage of load
CODE: lw $t1, 100($t2) EXE/MEM (address) <=
ID/EXE ($t2) + ID/EXE (signext(100))

0
M lw
U
X
1 Execution

IF/ID ID/EX EX/MEM MEM/WB


Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

Read
PC register 1 Read
address Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

8
MEM State of load
CODE: lw $t1, 100($t2) MEM/WB(Data) = Data[ EXE/MEM(address) ]

0
M
lw
U
X Memory
1

IF/ID ID/EX EX/MEM MEM/WB


Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

Read
PC register 1 Read Data Memory
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

9
WB Stage of load
CODE: lw $t1, 100($t2) Register($t1) <= MEM/WB(Data)

0
Who will supply
M
this address? lw
U
X
1
Write Back
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

Read
PC register 1 Read Data Memory
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

10
The Four Stages of R-type

Cycle 1 Cycle 2 Cycle 3 Cycle 4

R-type Ifetch Reg/Dec Exec Wr

 IF: fetch the instruction from the Instruction Memory


 ID: registers fetch and instruction decode
 EX: ALU operates on the two register operands
 WB: write ALU output back to the register file

11
Pipelining R-type and load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock

R-type Ifetch Reg/Dec Exec Wr Ops! We have a problem!


R-type Ifetch Reg/Dec Exec Wr

Load Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Wr

R-type Ifetch Reg/Dec Exec Wr

 We have a structural hazard:


 Two instructions try to write to the register file at the same time!
 Only one write port

12
Important Observation

 Each functional unit can only be used once per instruction


 Each functional unit must be used at the same stage for all
instructions:

1 2 3 4 5
Load Ifetch Reg/Dec Exec Mem Wr

1 2 3 4
R-type Ifetch Reg/Dec Exec Wr

13
Solution: Delay R-type’s Write
 Delay R-type’s register write by one cycle:
 R-type also use Reg File’s write port at Stage 5
 MEM is a NOP stage: nothing is being done.
1 2 3 4 5
R-type Ifetch Reg/Dec Exec Mem Wr

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9


Clock

R-type Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Mem Wr

Load Ifetch Reg/Dec Exec Mem Wr


R-type also has 5 R-type Ifetch Reg/Dec Exec Mem Wr
stages
R-type Ifetch Reg/Dec Exec Mem Wr

14
The Four Stages of store
Cycle 1 Cycle 2 Cycle 3 Cycle 4

Store Ifetch Reg/Dec Exec Mem Wr

 IF: fetch the instruction from the Instruction Memory


 ID: registers fetch and instruction decode
 EX: calculate the memory address
 MEM: write the data into the Data Memory

Add an extra stage:


 WB: NOP

15
The Three Stages of beq
Cycle 1 Cycle 2 Cycle 3 Cycle 4

Beq Ifetch Reg/Dec Exec Mem Wr

 IF: fetch the instruction from the Instruction Memory


 ID: registers fetch and instruction decode
 EX:
 compares the two register operand
 select correct branch target address
 latch into PC
Add two extra stages:
 MEM: NOP

 WB: NOP
16
Pipelined Datapath
0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
PC register 1 Read
address Data Memory
Data 1
Read
register 2 Zero
Instruction Read
Write Data 2 ALU Read
0 result Address Data 1
register
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend

17
Graphically Representing Pipelines
Time (in clock cycles)
Program
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
execution
order
(in instructions)
lw $10, 20($1) IM Reg ALU DM Reg

sub $11, $2, $3 IM Reg ALU DM Reg

 Can help with answering questions like:


 How many cycles to execute this code?
 What is the ALU doing during cycle 4?
 Help understand datapaths
18
Example 1: Cycle 1
lw $10, 20($1)
Instruction fetch
0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
register 1 Read
PC address Data Memory
Data 1
Read
register 2 Zero
Instruction Read Read
Write Data 2 0 ALU
result Address Data 1
register
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend

19
Example 1: Cycle 2
sub $11, $2, $3 lw $10, 20($1)
Instruction fetch Instruction decode
0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32

Sign
extend

20
Example 1: Cycle 3
sub $11, $2, $3 lw $10, 20($1)
Instruction decode Execution
0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32

Sign
extend

21
Example 1: Cycle 4
sub $11, $2, $3 lw $10, 20($1)
0
M Execution Memory
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32

Sign
extend

22
Example 1: Cycle 5
0 sub $11, $2, $3 lw $10, 20($1)
M
U Memory Write Back
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32

Sign
extend

23
Example 1: Cycle 6
0
M sub $11, $2, $3
U
X
Write Back
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32

Sign
extend

24
Pipeline Control: Control Signals
PCSrc

0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add
result
Branch
Shift
RegWrite left 2

Read MemWrite
Instruction

PC Address register 1 Read


Read data 1 ALU Src
register 2 Zero
Zero MemtoReg
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead

Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
1

RegDst

25
Effect of Seven
1-bit Control Signals
Signal name Effect when deasserted Effect when asserted
MemRead None Data memory contents at the read address are put on
read data output
MemWrite None Data memory contents at address given by write address
is replaced by value on write data input
ALUSrc The second ALU operand comes from the The second ALU operand is the sign-extended lower 16-
second Register fole output bits of the instruction
RegDst The register destination number for the Write The register destination number for the Write register
register comes from the rt field comes from the rd field
RegWrite None The register on the Write register input is written into with
the value on the write data input
PCSrc Thw PC is replaced by the output of the adder The PC is replaced by he output of the adder that
That computes the value of PC + 4 computes the branch target
MegtoReg The value fed to the register write data input The value fed to the register write data input comes from
comes from the ALU the data memory

The function of each of the seven control signals. When the 1-bit control to a two-
way multiplexor is asserted, the multiplexor selects the input corresponding to 1.
Otherwise, if the control is deserted, the multiplexor selects the 0 input.
Remember that the state elements all have the clock as an implicit input and that
the clock is used in controlling writes.
26
Group Signals According to Stages
 Original control for single clock cycle implementation

Memto- Reg Mem Mem


Insrtuction RegDst ALUSrc Branch ALUOp1 ALUOp0
Reg Write Read Write
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
bwq X 0 X 0 0 0 1 0 1

Execution/address calculation stage Memory access stage Write back stage


Memto- Reg Mem Mem
Insrtuction RegDst ALUSrc Branch ALUOp1 ALUOp0
Reg Write Read Write
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
bwq X 0 X 0 0 0 1 0 1

27
Data Stationary Control
 Pass control signals along just like the data
 Main control generates control signals during ID

WB

Instruction
Control M WB

EX M WB

Fig. 6.26
IF/ID ID/EX EX/MEM MEM/WB
28
Data Stationary Control (cont.)
 Signals for EX (ExtOp, ALUSrc, ...) are used 1 cycle later
 Signals for MEM (MemWr, Branch) are used 2 cycles later
 Signals for WB (MemtoReg, MemWr) are used 3 cycles later

ID EX MEM WB

ExtOp ExtOp
ALUSrc ALUSrc

MEM/WB Register
Ex/MEM Register
ALUOp ALUOp
ID/Ex Register
IF/ID Register

Main RegDst RegDst


Control
MemWr MemWr MemWr
Branch Branch Branch
MemtoReg MemtoReg MemtoReg MemtoReg
RegWr RegWr RegWr RegWr

29
WB Stage of load
Register($t1) <= MEM/WB(Data)

0
Who will supply
M
this address? lw
U
X
1
Write Back
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

Read
PC register 1 Read Data Memory
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

30
Datapath with Control
PCSrc

ID/EX
EX/MEM
0 WB
M
U
WB
X Control M
1 MEM/WB
M
EX WB
IF/ID
Add
Add
ALU
4 result
RegWrite

branch
Shift
Left 2

MemWrite
ALUSrc
Read
Instruction

Read ALU

MemtoReg
PC register 1 Read
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data 1
register M
M
U
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory Instruction Data Memory
16 32 6
[15:0]
Sign ALU
extend
Instruction Control MemRead
[20:16]
0
M ALUOp
Instruction
U
[15:11] X
1
RegDst

31
Summary of Pipeline Basics
 Pipelining is a fundamental concept
 Multiple steps using distinct resources
 Utilize capabilities of datapath by pipelined instruction
processing
 Start next instruction while working on the current one
 Limited by length of longest stage (plus fill/flush)
 Need to detect and resolve hazards
 What makes it easy in MIPS?
 All instructions are of the same length
 Just a few instruction formats
 Memory operands only in loads and stores
 What makes pipelining hard? hazards
32
Hazard Detection
 One of the source register number (in the pipeline register
ID/EX) is equal to the register number in the EX/MEM or
MEM/WB stage
- 1a. EX/MEM RegisterRd = ID/EX.RegisterRs = $2
- 1b. EX/MEM RegisterRd = ID/EX.RegusterRt
‧ Example
‧ sub $2, $1, $3
‧ and $12, $2, $5
- 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
- 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
‧ Example
‧ sub $2, $1, $3
‧ and $12, $2, $5
‧ or $13, $6, $2
- Condition: EX/MEM.RegisterRd != 0, MEM/WB.RegisterRd != 0
(why, see the next page)
33
Harzard Detection
 This policy is inaccurate
 Sometimes it would forward when unnecessary
* Some instruction do not write register
 check if the RegWrite signal will be active:
 Examining the WB control field of the pipeline register during the
EX and MEM stages

* Register $0 as the destination


 sll $0, $1, 2
 $0 cannot be forwarded ($0 cannot be changed)
 Add (!=0) condition to correct it

34
Data to be Forwarded
Time(in clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Value of register $2: 10 10 10 10 10/-20 -20 -20 -20 -20
Value if EX/MEM: X X X -20 X X X X X
Value of MEM/WB: X X X X -20 X X X X
Program
execution EX/MEMMEM/WB
order
(in instructions) ALU inputs can be from
sub $2, $s1, $s3 IM Reg DM Reg
ID/EXE any pipeline registers
(add MUX to select)
and $12, $s2, $s5 IM Reg DM Reg
ID/EXE

or $13, $s6, $s2 IM Reg DM Reg

add $14, $s2, $s2 IM Reg DM Reg

sw $15, 100($s2) IM Reg DM Reg

35
Datapath without Forwarding
ID/EX EX/MEM MEM/WB

Registers
ALU

Data M
U
X

a. No forwarding

36
Datapath with Forwarding
ID/EX EX/MEM MEM/WB
M
U
X
Registers ForwardA
ALU
M
Data M
U
X memory U
X
ForwardB

Rs
Rt
M EX/MEM.RegisterRd
Rt
Rd U
X
Forwarding MEM/WB.RegisterRd
unit

b. With forwarding
37
Forwarding Control
 Three sources for the MUX
 ID/EX: no forwarding, just from the register file (00)
 EX/MEM: forwarded data from the prior ALU results (10)
 MEM/WB: from data memory or an earlier ALU result (01)

38
Forwarding Control
 Forwarding control will be in the EX stage because ALU
forwarding MUX is in this stage
 Pass the operand register number from the ID stage via the
ID/EX pipeline register
 We already have rt field (bits 20-16), add rs to ID/EX pipeline
register
Select
 1. EX hazard forwarded data
from EX/MEM
If (EX/MEM.RegWrite stage
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd==ID/EX.RegisterRs)) ForwardA=10
If (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB=10
39
Forwarding Control
 MEM hazard
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA=01
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB=01

Select forwarded data from


MEM/WB stage

40
Potential More Complicated Data Hazard

 Between the results in the WB stage, MEM stage and ALU source
add $1, $1, $2
add $1, $1, $3
add $1, $1, $4
(vector summation)
 In above case, two forwarding cases will occur but the MEM hazard
is incorrect one due to it is old one. Select the forwarded data from
EX/MEM stage.
 Modified control for MEM hazard to prevent this Prevent to select
forwarded data
If (MEM/WB.RegWrite from MEM/WB
stage
and (MEM/WB.RegisterRD != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRs)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA=01
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRD != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRt)
and (MEM/WB.RegisterRd==ID/EX.RegisterRt)) ForwardB=01
41
What if Data Hazard cannot be
Solved by Forwarding
 lw can still cause a hazard:
 if is followed by an instruction to read the loaded reg.
lw $2, 20($1) IM Reg DM Reg

and $4, $2, $5 IM Reg DM Reg

or $8, $2, $6 IM Reg DM Reg

add $9, $4, $2 IM Reg DM Reg

slt $1, $6, $7 IM Reg DM Reg

Use stalling or compiler to resolve


42
Hazard Detection Unit
 Operates during the ID stage
 Insert the stall between the load and its use
IF (ID/EX.MemRead and
( (ID/EX.RegisterRd = IF/ID.RegisterRs) or
(ID/EX.RegisterRd = IF/ID.RegisterRt) ) )
Stall the pipeline
IF (instruction is a load or
( (destination register of load match either
source register of the instruction in the ID stage) ) )
Stall the pipeline

43
Stall the Pipeline
 IF and ID Stage
 Preserving the register value
 Instruction in the IF stage will continue to be read using the
same PC
 Register in the ID stage will continue to be read using the same
instruction field in the ID/EXE pipeline registers
 Other stages (EX, MEM, WB)
 Insert “NOP” instruction: do nothing
 That is: deasserting all nine control signals (set to 0)
 No register or memories are written if the control are all 0

44
Stall the Pipeline

 Stall pipeline by keeping instructions in same stage and


inserting an NOP instead
Program Time (in clock cycles)
execution
order CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
(in instructions)
lw $2, 20($1) IM Reg DM Reg

And becomes nop IM Reg DM Reg

Add $4, $2, $5 IM Reg DM Reg

or $8, $s2, $s6 IM Reg DM Reg

add $9, $4, $2 IM Reg DM Reg

45
Handling Stalls
 Hazard detection unit in ID to insert stall between a load
instruction and its use:
if (ID/EX.MemRead and
((ID/EX.RegisterRt = = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = = IF/ID.registerRt))
stall the pipeline for one cycle
(ID/EX.MemRead=1 indicates a load instruction)
 How to stall?
 Stall instruction in IF and ID: not change PC and IF/ID
=> the stages re-execute the instructions
 What to move into EX: insert an NOP by changing EX, MEM, WB
control fields of ID/EX pipeline register to 0
 as control signals propagate, all control signals to EX, MEM, WB are
deasserted and no registers or memories are written
46
Branch Hazards
 When decide to branch, other inst. are in pipeline!
(instructions)

40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IM Reg DM Reg

48 or $13, $6, $2 IM Reg DM Reg

52 add $14, $2, $2 IM Reg DM Reg

72 lw $4, 50($7)
IM Reg DM Reg

Remember: branch taken on the MEM stage


47
Solving Control Hazards
 Note: No effective scheme to solve control hazards like the
forwarding on data hazards. So just use the simpler one.
 Assume Branch Not Taken
 Continually fetch new instruction down the sequential
instruction stream
 When this strategy will gain
 If branches are untaken half the time and if it costs little to discard
the instructions
 If the prediction is wrong, discard the fetched instruction
 Discard instruction
 Flush instructions in the IF, ID and EX stages

48
Solving Control Hazards
 Reducing the delays of branches
 Concept
 Move the branch execution earlier in the pipeline, then fewer
instructions need flushed.
 How MIPS designer do?
 Make the common case fast
 Many branches rely only on simple tests (equality or sign)
 Such test can be done with a few gates without full ALU

49
Solving Control Hazards
 Move the branch execution from MEM to the ID stage
 2 steps in the ID stage
 1. compute the branch target (PC + offset)
 Move the branch adder from the EXE stage to the ID stage
 2. evaluate the branch decision
 Equality test of two registers
 XOR their respective bits and OR all the results
 Flush instructions in the IF stage
 New control line: IF.Flush
 Zero the instruction field of the IF/ID pipeline registers (=>NOP)

50
Dynamic Branch Prediction
 Performance = ƒ(accuracy, cost of misprediction)
 Branch prediction buffer or branch history table
 A small memory indexed by the lower portion of the address of
branch instruction. The memory contains a bit that say whether
the branch was recently taken or not.
 No address check
 Simplest one: 1-bit prediction

51
Basic Branch Prediction Buffers
Branch History Table (BHT) - Small direct-mapped cache of T/NT bits
Branch Instruction
IR:
+ Branch Target
PC:
BHT T (predict taken)

NT (predict not- taken)

PC + 4

52
Branch Prediction Buffer
 A branch prediction buffer can be implemented as a small
cache accessed during the IF stage.

53
Dynamic Branch Prediction
 2-bit Prediction Scheme
 A prediction must be wrong twice before it is changed
 Suitable for strongly favors taken or not taken
 Mispredicted once

 Advanced ones
 Correlating predictors
 Global and local branch
 Tournament predictor
 Multiple predictions for each branch

54
Scheduling the Branch Delay Slot
a. From before b. From target c. From fall through

add $s1, $s2, $s3 sub $t4, $t5, $t6 add $s1, $s2, $s3

beq $1, $3, 7 ‧‧‧ beq $1, $3, 7

Delay slot add $s1, $s2, $s3 Delay slot


beq $1, $3, 7
sub $t4, $t5, $t6
Delay slot

Becomes Becomes Becomes

add $s1, $s2, $s3

beq $1, $3, 7 beq $1, $3, 7


add $s1, $s2, $s3
add $s1, $s2, $s3 sub $t4, $t5, $t6
beq $1, $3, 7

sub $t4, $t5, $t6

 A is the best. Use B C when A is impossible (data dependency).


 B is preferred when the branch is taken with high probability such as a loop
 C is scheduled from the not-taken fall through.
 It should be O.K. to execute delay slot instruction for B C cases.
55
How it Works?
EX. Flush

IF. Flush ID. Flush

Hazard
detection
unit
M
ID/EX U
0 X
WB
M M EX/MEM
Control U M U WB MEM/WB
X Cause X
IF/ID + 0 EX EPC
0 M WB

+ Shift
Left 2 M
4
U
X

M
Registers - M
Instruction
ALU U
00000100 U PC M
memory Data X
X U memory
X
Signed
extend

M
U
X

Forwarding
unit

56
Instruction Level Parallelism
 Reference for more details
 Computer architecture: A quantitative approach
 Two methods to increase ILP
 Increase the pipeline depth
 More operations being overlapped
 Pipeline speedup α pipeline depth
 8 or more pipeline stages
 To get the speedup, rebalance the remaining steps
 Performance is potentially greater due to shorter the clock cycle
 Multiple issue
 Issue 3 to 8 instructions in every clock cycle
 Static multiple issue: determined at compile time
 Dynamic multiple issue: determined during execution
 Two problems
 How to package instruction into issue slots (by compiler or hardware)
 Dealing with data and control hazards (by compiler or hardware)
57
Speculation: Find and Exploit more ILP
 Speculation
 An approach that allows the compiler or the processor to “guess”
the outcome of an instruction to remove it as a dependence in
executing other instructions
 E.g. branch, store before load
 How it works
 Compiler or processor use speculation to
 reorder instructions,
 move an instruction across a branch or
 a load across a store
 Mechanism
 A method to check if guess right and a method to back out the
effects
 Difficulty: what if guess wrong (back-out capability)
58
Speculation: Find and Exploit more ILP

 Recovery mechanism for incorrect speculation


 Software approach
 Compiler inserts additional instructions to
 Check the accuracy of the speculation
 Provide a fix-up routine

 Hardware approach
 Buffer the speculative results until no longer speculative
 If correct, complete the instruction (write results to registers)
 If incorrect, flush the buffer and re-execute the correct one

59
Speculation: Find and Exploit more ILP

 Other possible problem: Exception in speculative instruction


 Speculating on certain instructions may introduce
exceptions that were formerly not present
 E.g. If executing “load” in speculative, but the address is illegal,
then “exception” that should not happen will occur. (Exception
should occur when load is not speculative)
 Compiler-based speculation
 Allow such exceptions ignored until they should occur
 Hardware-based speculation
 Buffer such exceptions until no longer speculative, then raise the
exception

60
Static Multiple Issue
 Compiler assist packaging instruction and handling data
hazards
 Issue packet
 As one large instruction with multiple operations
 VLIW: very long instruction word
 EPIC: Explicitly Parallel Instruction Computer (IA-64)
 Variation: how compiler handle hazards
 1. Compilers handle all hazards, schedule code, and insert code
 2. Compiler handle all dependences within an instruction, and
hardware detects data hazards and generates stalls between
two issue packets

61
Two-Issue MIPS Processor
 Static two-issue pipeline (64-bits IF and ID)

ALU or branch IF ID EX MEM WB


Load or store IF ID EX MEM WB
 Extra hardware ALU or branch IF ID EXE MEM WB
Load or store IF ID EXE MEM WB
 Register file
 2 read for ALU, 2 read for store, one write for ALU, one write for load
 Separated adder for address calculation of data transfers
 Performance
 Improve up to a factor of 2 (upper bound)
 In reality, it depends on how you schedule the instructions.
Compiler takes on this role.

62
Multiple Issue Code Scheduling
Original : add scalar $s2 to array
Loop:
lw $t0, 0($s1) #$t0 = array element
addu $t0, $t0, $s2 #add scalar in $s2
sw $t0, 0($s1) #store result
addi $s1, $s1, -4 #decrement pointer
bne $s1, $zero,loop #branch $s1 != 0

 Note. The result of a load cannot be used on the next clock


cycle due to load-use dependency.
Scheduled code for two-issue MIPS (4 cycles)
ALU or branch inst. Data transfer inst.
Loop:
lw $t0, 0($s1)
addi $s1, $s1, -4
addu $t0, $t0, $s2
bne $s1, $zero, loop sw $t0, 4($s1)

63
Loop Unrolling for 2-Issue MIPS
 To get more performance from loops: loop unrolling
 Assume the loop index is multiple of four
 Unroll four loop: register renaming to remove antidependences
Loop:
addi $s1, $s1, -16 lw $t0, 0($s1)
lw $t1, 12($s1)
addu $t0, $t0, $s2 lw $t2, 8($s1)
addu $t1, $t1, $s2 lw $t3, 4($s1)
addu $t2, $t2, $s2 sw $t0, 16($s1)
addu $t3, $t3, $s2 sw $t1, 12($s1)
sw $t2, 8($s1)
beq $s1, $zero, loop sw $t3, 4($s1)
for (i = 0; i < 16; i++){ for (i = 0; i < 16; i+4){
array[i] = array[i] + scalar; array[i] = array[i] + scalar;
} array[i+1] = array[i+1] + scalar;
array[i+2] = array[i+2] + scalar;
array[i+3] = array[i+3] + scalar;
}
64
Dynamic Multiple
Issue Processors
 Suplerscalar
 Instruction issue in order
 0, 1 or more instructions can issue in a give clock cycle
 To achieve good performance
 Needs compiler to schedule instructions
 More important: hardware guarantees instructions are executed
correctly whether scheduled or not
 Extension: dynamic pipeline scheduling
 Hardware support to reorder the execution order to avoid stalls

65
Dynamic Pipeline Scheduling
Instruction fetch
and decode unit In-order issue

Reservation Reservation Reservation Reservation



station station station station

Functional Integer … Floating Load/


Integer Out-of-order execute
units point Store

Commit
unit
In-order commit

66
Summary
 Performance is specific to a particular program/s
 Total execution time is a consistent summary of performance
 For a given architecture performance increases come from:
 increases in clock rate (without adverse CPI affects)
 improvements in processor organization that lower CPI
 compiler enhancements that lower CPI and/or instruction count
 Algorithm/Language choices that affect instruction count
 Amdahl’s law

68
See You Next Class!

69

You might also like