0% found this document useful (0 votes)

6 views68 pages

CH 7

The document discusses the concept of pipelining in computer organization, detailing the stages of instruction processing including instruction fetch, decode, execute, memory access, and write back. It highlights the importance of pipeline registers and addresses potential hazards such as structural hazards when multiple instructions attempt to access the same resources. Additionally, it provides examples of how different instruction types (load, store, R-type, and branch) are processed in a pipelined architecture.

Uploaded by

秦槐駿

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views68 pages

CH 7

Uploaded by

秦槐駿

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Computer Organization

Pipeline

Prof. Ya-Shu Chen

National Taiwan University of Science and Technology
1
Split Single-cycle Datapath
IF:Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write Back
register file read address: calculation
0
M
U

Feedback
X
1

Path
Add
Add
ALU
4 result
Shift
Left 2

Read Read ALU

PC register 1 Read
addrsss Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data 1
register M M
U U
Write X X
Data 1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend

What to add to split the datapath into stages?

2
Add Pipeline Registers
0
Pipeline registers (latches)
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add
Add
ALU
4 result
Shift
Left 2

Read Read ALU

PC register 1 Read
addresss Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

 Use registers between stages to carry data and control

3
Consider load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load Ifetch Reg/Dec Exec Mem Wr

 IF: Instruction Fetch

 Fetch the instruction from the Instruction Memory
 ID: Instruction Decode
 Registers fetch and instruction decode
 EX: Calculate the memory address
 MEM: Read the data from the Data Memory
 WB: Write the data back to the register file

4
Pipelining load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock

1st lw Ifetch Reg/Dec Exec Mem Wr

2nd lw Ifetch Reg/Dec Exec Mem Wr

3rd lw Ifetch Reg/Dec Exec Mem Wr

 5 functional units in the pipeline datapath are:

 Instruction Memory for the Ifetch stage
 Register File’s Read ports (busA and busB) for the Reg/Dec
stage
 ALU for the Exec stage
 Data Memory for the MEM stage
 Register File’s Write port (busW) for the WB stage
5
IF Stage of load
CODE: lw $t1, 100($t2) PC = PC + 4
IF/ID <= PC+4
lw
Instruction fetch
IR, PC+4 IF/ID <= MEM[PC]

0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

Read
PC register 1 Read
address Data 1
Read Zero Data Memory
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

6
ID Stage of load
CODE: lw $t1, 100($t2) ID/EXE() <= IF/ID (PC +4)
ID/EXE($t1) <= $t1
ID/EXE($t2) <= $t2
0
ID/EXE(singext) <= signext(100)
M
U lw
X
1
Instruction decode

IF/ID ID/EX EX/MEM MEM/WB

Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

7
EX Stage of load
CODE: lw $t1, 100($t2) EXE/MEM (address) <=
ID/EXE ($t2) + ID/EXE (signext(100))

0
M lw
U
X
1 Execution

IF/ID ID/EX EX/MEM MEM/WB

Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

8
MEM State of load
CODE: lw $t1, 100($t2) MEM/WB(Data) = Data[ EXE/MEM(address) ]

0
M
lw
U
X Memory
1

IF/ID ID/EX EX/MEM MEM/WB

Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

Read
PC register 1 Read Data Memory
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data
register M 1
U M
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory 16 32
Sign
extend

9
WB Stage of load
CODE: lw $t1, 100($t2) Register($t1) <= MEM/WB(Data)

0
Who will supply
M
this address? lw
U
X
1
Write Back
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

10
The Four Stages of R-type

Cycle 1 Cycle 2 Cycle 3 Cycle 4

R-type Ifetch Reg/Dec Exec Wr

 IF: fetch the instruction from the Instruction Memory

 ID: registers fetch and instruction decode
 EX: ALU operates on the two register operands
 WB: write ALU output back to the register file

11
Pipelining R-type and load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock

R-type Ifetch Reg/Dec Exec Wr Ops! We have a problem!

R-type Ifetch Reg/Dec Exec Wr

Load Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Wr

 We have a structural hazard:

 Two instructions try to write to the register file at the same time!
 Only one write port

12
Important Observation

 Each functional unit can only be used once per instruction

 Each functional unit must be used at the same stage for all
instructions:

1 2 3 4 5
Load Ifetch Reg/Dec Exec Mem Wr

1 2 3 4
R-type Ifetch Reg/Dec Exec Wr

13
Solution: Delay R-type’s Write
 Delay R-type’s register write by one cycle:
 R-type also use Reg File’s write port at Stage 5
 MEM is a NOP stage: nothing is being done.
1 2 3 4 5
R-type Ifetch Reg/Dec Exec Mem Wr

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Clock

R-type Ifetch Reg/Dec Exec Mem Wr

Load Ifetch Reg/Dec Exec Mem Wr

R-type also has 5 R-type Ifetch Reg/Dec Exec Mem Wr
stages
R-type Ifetch Reg/Dec Exec Mem Wr

14
The Four Stages of store
Cycle 1 Cycle 2 Cycle 3 Cycle 4

Store Ifetch Reg/Dec Exec Mem Wr

 IF: fetch the instruction from the Instruction Memory

 ID: registers fetch and instruction decode
 EX: calculate the memory address
 MEM: write the data into the Data Memory

Add an extra stage:

 WB: NOP

15
The Three Stages of beq
Cycle 1 Cycle 2 Cycle 3 Cycle 4

Beq Ifetch Reg/Dec Exec Mem Wr

 IF: fetch the instruction from the Instruction Memory

 ID: registers fetch and instruction decode
 EX:
 compares the two register operand
 select correct branch target address
 latch into PC
Add two extra stages:
 MEM: NOP

 WB: NOP
16
Pipelined Datapath
0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
PC register 1 Read
address Data Memory
Data 1
Read
register 2 Zero
Instruction Read
Write Data 2 ALU Read
0 result Address Data 1
register
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend

17
Graphically Representing Pipelines
Time (in clock cycles)
Program
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
execution
order
(in instructions)
lw $10, 20($1) IM Reg ALU DM Reg

sub $11, $2, $3 IM Reg ALU DM Reg

 Can help with answering questions like:

 How many cycles to execute this code?
 What is the ALU doing during cycle 4?
 Help understand datapaths
18
Example 1: Cycle 1
lw $10, 20($1)
Instruction fetch
0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
register 1 Read
PC address Data Memory
Data 1
Read
register 2 Zero
Instruction Read Read
Write Data 2 0 ALU
result Address Data 1
register
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32
Sign
extend

19
Example 1: Cycle 2
sub $11, $2, $3 lw $10, 20($1)
Instruction fetch Instruction decode
0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Read Read
register 1
PC address Read Data Memory
Read Data 1
register 2 Zero
Instruction Read ALU Read
Write 0 result Address
register Data 2 Data 1
M M
Write U U
Data X X
1 0
Instruction Registers Write
Data
memory 16 32

Sign
extend

20
Example 1: Cycle 3
sub $11, $2, $3 lw $10, 20($1)
Instruction decode Execution
0
M
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Sign
extend

21
Example 1: Cycle 4
sub $11, $2, $3 lw $10, 20($1)
0
M Execution Memory
U
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Sign
extend

22
Example 1: Cycle 5
0 sub $11, $2, $3 lw $10, 20($1)
M
U Memory Write Back
X
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Sign
extend

23
Example 1: Cycle 6
0
M sub $11, $2, $3
U
X
Write Back
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
ALU
4 result

Shift
Left 2

ALU
Instruction

Sign
extend

24
Pipeline Control: Control Signals
PCSrc

0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add
result
Branch
Shift
RegWrite left 2

Read MemWrite
Instruction

PC Address register 1 Read

Read data 1 ALU Src
register 2 Zero
Zero MemtoReg
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead

Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
1

RegDst

25
Effect of Seven
1-bit Control Signals
Signal name Effect when deasserted Effect when asserted
MemRead None Data memory contents at the read address are put on
read data output
MemWrite None Data memory contents at address given by write address
is replaced by value on write data input
ALUSrc The second ALU operand comes from the The second ALU operand is the sign-extended lower 16-
second Register fole output bits of the instruction
RegDst The register destination number for the Write The register destination number for the Write register
register comes from the rt field comes from the rd field
RegWrite None The register on the Write register input is written into with
the value on the write data input
PCSrc Thw PC is replaced by the output of the adder The PC is replaced by he output of the adder that
That computes the value of PC + 4 computes the branch target
MegtoReg The value fed to the register write data input The value fed to the register write data input comes from
comes from the ALU the data memory

The function of each of the seven control signals. When the 1-bit control to a two-
way multiplexor is asserted, the multiplexor selects the input corresponding to 1.
Otherwise, if the control is deserted, the multiplexor selects the 0 input.
Remember that the state elements all have the clock as an implicit input and that
the clock is used in controlling writes.
26
Group Signals According to Stages
 Original control for single clock cycle implementation

Memto- Reg Mem Mem

Insrtuction RegDst ALUSrc Branch ALUOp1 ALUOp0
Reg Write Read Write
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
bwq X 0 X 0 0 0 1 0 1

Execution/address calculation stage Memory access stage Write back stage

Memto- Reg Mem Mem
Insrtuction RegDst ALUSrc Branch ALUOp1 ALUOp0
Reg Write Read Write
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
bwq X 0 X 0 0 0 1 0 1

27
Data Stationary Control
 Pass control signals along just like the data
 Main control generates control signals during ID

Instruction
Control M WB

EX M WB

Fig. 6.26
IF/ID ID/EX EX/MEM MEM/WB
28
Data Stationary Control (cont.)
 Signals for EX (ExtOp, ALUSrc, ...) are used 1 cycle later
 Signals for MEM (MemWr, Branch) are used 2 cycles later
 Signals for WB (MemtoReg, MemWr) are used 3 cycles later

ID EX MEM WB

ExtOp ExtOp
ALUSrc ALUSrc

MEM/WB Register
Ex/MEM Register
ALUOp ALUOp
ID/Ex Register
IF/ID Register

Main RegDst RegDst

Control
MemWr MemWr MemWr
Branch Branch Branch
MemtoReg MemtoReg MemtoReg MemtoReg
RegWr RegWr RegWr RegWr

29
WB Stage of load
Register($t1) <= MEM/WB(Data)

0
Who will supply
M
this address? lw
U
X
1
Write Back
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
ALU
4 result
Shift
Left 2

Read ALU
Instruction

30
Datapath with Control
PCSrc

ID/EX
EX/MEM
0 WB
M
U
WB
X Control M
1 MEM/WB
M
EX WB
IF/ID
Add
Add
ALU
4 result
RegWrite

branch
Shift
Left 2

MemWrite
ALUSrc
Read
Instruction

Read ALU

MemtoReg
PC register 1 Read
address Data 1
Read Zero
register 2
Instruction Read
Write ALU Read
Data 2 0
result Address Data 1
register M
M
U
Write X U
Data 1 X
0
Instruction Registers Write
Data
memory Instruction Data Memory
16 32 6
[15:0]
Sign ALU
extend
Instruction Control MemRead
[20:16]
0
M ALUOp
Instruction
U
[15:11] X
1
RegDst

31
Summary of Pipeline Basics
 Pipelining is a fundamental concept
 Multiple steps using distinct resources
 Utilize capabilities of datapath by pipelined instruction
processing
 Start next instruction while working on the current one
 Limited by length of longest stage (plus fill/flush)
 Need to detect and resolve hazards
 What makes it easy in MIPS?
 All instructions are of the same length
 Just a few instruction formats
 Memory operands only in loads and stores
 What makes pipelining hard? hazards
32
Hazard Detection
 One of the source register number (in the pipeline register
ID/EX) is equal to the register number in the EX/MEM or
MEM/WB stage
－ 1a. EX/MEM RegisterRd = ID/EX.RegisterRs = $2
－ 1b. EX/MEM RegisterRd = ID/EX.RegusterRt
‧ Example
‧ sub $2, $1, $3
‧ and $12, $2, $5
－ 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
－ 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
‧ Example
‧ sub $2, $1, $3
‧ and $12, $2, $5
‧ or $13, $6, $2
－ Condition: EX/MEM.RegisterRd != 0, MEM/WB.RegisterRd != 0
(why, see the next page)
33
Harzard Detection
 This policy is inaccurate
 Sometimes it would forward when unnecessary
* Some instruction do not write register
 check if the RegWrite signal will be active:
 Examining the WB control field of the pipeline register during the
EX and MEM stages

* Register $0 as the destination

 sll $0, $1, 2
 $0 cannot be forwarded ($0 cannot be changed)
 Add (!=0) condition to correct it

34
Data to be Forwarded
Time(in clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Value of register $2: 10 10 10 10 10/-20 -20 -20 -20 -20
Value if EX/MEM: X X X -20 X X X X X
Value of MEM/WB: X X X X -20 X X X X
Program
execution EX/MEMMEM/WB
order
(in instructions) ALU inputs can be from
sub $2, $s1, $s3 IM Reg DM Reg
ID/EXE any pipeline registers
(add MUX to select)
and $12, $s2, $s5 IM Reg DM Reg
ID/EXE

or $13, $s6, $s2 IM Reg DM Reg

add $14, $s2, $s2 IM Reg DM Reg

sw $15, 100($s2) IM Reg DM Reg

35
Datapath without Forwarding
ID/EX EX/MEM MEM/WB

Registers
ALU

Data M
U
X

a. No forwarding

36
Datapath with Forwarding
ID/EX EX/MEM MEM/WB
M
U
X
Registers ForwardA
ALU
M
Data M
U
X memory U
X
ForwardB

Rs
Rt
M EX/MEM.RegisterRd
Rt
Rd U
X
Forwarding MEM/WB.RegisterRd
unit

b. With forwarding
37
Forwarding Control
 Three sources for the MUX
 ID/EX: no forwarding, just from the register file (00)
 EX/MEM: forwarded data from the prior ALU results (10)
 MEM/WB: from data memory or an earlier ALU result (01)

38
Forwarding Control
 Forwarding control will be in the EX stage because ALU
forwarding MUX is in this stage
 Pass the operand register number from the ID stage via the
ID/EX pipeline register
 We already have rt field (bits 20-16), add rs to ID/EX pipeline
register
Select
 1. EX hazard forwarded data
from EX/MEM
If (EX/MEM.RegWrite stage
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd==ID/EX.RegisterRs)) ForwardA=10
If (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB=10
39
Forwarding Control
 MEM hazard
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA=01
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB=01

Select forwarded data from

MEM/WB stage

40
Potential More Complicated Data Hazard

 Between the results in the WB stage, MEM stage and ALU source
add $1, $1, $2
add $1, $1, $3
add $1, $1, $4
(vector summation)
 In above case, two forwarding cases will occur but the MEM hazard
is incorrect one due to it is old one. Select the forwarded data from
EX/MEM stage.
 Modified control for MEM hazard to prevent this Prevent to select
forwarded data
If (MEM/WB.RegWrite from MEM/WB
stage
and (MEM/WB.RegisterRD != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRs)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA=01
If (MEM/WB.RegWrite
and (MEM/WB.RegisterRD != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRt)
and (MEM/WB.RegisterRd==ID/EX.RegisterRt)) ForwardB=01
41
What if Data Hazard cannot be
Solved by Forwarding
 lw can still cause a hazard:
 if is followed by an instruction to read the loaded reg.
lw $2, 20($1) IM Reg DM Reg

and $4, $2, $5 IM Reg DM Reg

or $8, $2, $6 IM Reg DM Reg

add $9, $4, $2 IM Reg DM Reg

slt $1, $6, $7 IM Reg DM Reg

Use stalling or compiler to resolve

42
Hazard Detection Unit
 Operates during the ID stage
 Insert the stall between the load and its use
IF (ID/EX.MemRead and
( (ID/EX.RegisterRd = IF/ID.RegisterRs) or
(ID/EX.RegisterRd = IF/ID.RegisterRt) ) )
Stall the pipeline
IF (instruction is a load or
( (destination register of load match either
source register of the instruction in the ID stage) ) )
Stall the pipeline

43
Stall the Pipeline
 IF and ID Stage
 Preserving the register value
 Instruction in the IF stage will continue to be read using the
same PC
 Register in the ID stage will continue to be read using the same
instruction field in the ID/EXE pipeline registers
 Other stages (EX, MEM, WB)
 Insert “NOP” instruction: do nothing
 That is: deasserting all nine control signals (set to 0)
 No register or memories are written if the control are all 0

44
Stall the Pipeline

 Stall pipeline by keeping instructions in same stage and

inserting an NOP instead
Program Time (in clock cycles)
execution
order CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
(in instructions)
lw $2, 20($1) IM Reg DM Reg

And becomes nop IM Reg DM Reg

Add $4, $2, $5 IM Reg DM Reg

or $8, $s2, $s6 IM Reg DM Reg

add $9, $4, $2 IM Reg DM Reg

45
Handling Stalls
 Hazard detection unit in ID to insert stall between a load
instruction and its use:
if (ID/EX.MemRead and
((ID/EX.RegisterRt = = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = = IF/ID.registerRt))
stall the pipeline for one cycle
(ID/EX.MemRead=1 indicates a load instruction)
 How to stall?
 Stall instruction in IF and ID: not change PC and IF/ID
=> the stages re-execute the instructions
 What to move into EX: insert an NOP by changing EX, MEM, WB
control fields of ID/EX pipeline register to 0
 as control signals propagate, all control signals to EX, MEM, WB are
deasserted and no registers or memories are written
46
Branch Hazards
 When decide to branch, other inst. are in pipeline!
(instructions)

40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IM Reg DM Reg

48 or $13, $6, $2 IM Reg DM Reg

52 add $14, $2, $2 IM Reg DM Reg

72 lw $4, 50($7)
IM Reg DM Reg

Remember: branch taken on the MEM stage

47
Solving Control Hazards
 Note: No effective scheme to solve control hazards like the
forwarding on data hazards. So just use the simpler one.
 Assume Branch Not Taken
 Continually fetch new instruction down the sequential
instruction stream
 When this strategy will gain
 If branches are untaken half the time and if it costs little to discard
the instructions
 If the prediction is wrong, discard the fetched instruction
 Discard instruction
 Flush instructions in the IF, ID and EX stages

48
Solving Control Hazards
 Reducing the delays of branches
 Concept
 Move the branch execution earlier in the pipeline, then fewer
instructions need flushed.
 How MIPS designer do?
 Make the common case fast
 Many branches rely only on simple tests (equality or sign)
 Such test can be done with a few gates without full ALU

49
Solving Control Hazards
 Move the branch execution from MEM to the ID stage
 2 steps in the ID stage
 1. compute the branch target (PC + offset)
 Move the branch adder from the EXE stage to the ID stage
 2. evaluate the branch decision
 Equality test of two registers
 XOR their respective bits and OR all the results
 Flush instructions in the IF stage
 New control line: IF.Flush
 Zero the instruction field of the IF/ID pipeline registers (=>NOP)

50
Dynamic Branch Prediction
 Performance = ƒ(accuracy, cost of misprediction)
 Branch prediction buffer or branch history table
 A small memory indexed by the lower portion of the address of
branch instruction. The memory contains a bit that say whether
the branch was recently taken or not.
 No address check
 Simplest one: 1-bit prediction

51
Basic Branch Prediction Buffers
Branch History Table (BHT) - Small direct-mapped cache of T/NT bits
Branch Instruction
IR:
+ Branch Target
PC:
BHT T (predict taken)

NT (predict not- taken)

PC + 4

52
Branch Prediction Buffer
 A branch prediction buffer can be implemented as a small
cache accessed during the IF stage.

53
Dynamic Branch Prediction
 2-bit Prediction Scheme
 A prediction must be wrong twice before it is changed
 Suitable for strongly favors taken or not taken
 Mispredicted once

 Advanced ones
 Correlating predictors
 Global and local branch
 Tournament predictor
 Multiple predictions for each branch

54
Scheduling the Branch Delay Slot
a. From before b. From target c. From fall through

add $s1, $s2, $s3 sub $t4, $t5, $t6 add $s1, $s2, $s3

beq $1, $3, 7 ‧‧‧ beq $1, $3, 7

Delay slot add $s1, $s2, $s3 Delay slot

beq $1, $3, 7
sub $t4, $t5, $t6
Delay slot

Becomes Becomes Becomes

add $s1, $s2, $s3

beq $1, $3, 7 beq $1, $3, 7

add $s1, $s2, $s3
add $s1, $s2, $s3 sub $t4, $t5, $t6
beq $1, $3, 7

sub $t4, $t5, $t6

 A is the best. Use B C when A is impossible (data dependency).

 B is preferred when the branch is taken with high probability such as a loop
 C is scheduled from the not-taken fall through.
 It should be O.K. to execute delay slot instruction for B C cases.
55
How it Works?
EX. Flush

IF. Flush ID. Flush

Hazard
detection
unit
M
ID/EX U
0 X
WB
M M EX/MEM
Control U M U WB MEM/WB
X Cause X
IF/ID + 0 EX EPC
0 M WB

+ Shift
Left 2 M
4
U
X

M
Registers - M
Instruction
ALU U
00000100 U PC M
memory Data X
X U memory
X
Signed
extend

M
U
X

Forwarding
unit

56
Instruction Level Parallelism
 Reference for more details
 Computer architecture: A quantitative approach
 Two methods to increase ILP
 Increase the pipeline depth
 More operations being overlapped
 Pipeline speedup α pipeline depth
 8 or more pipeline stages
 To get the speedup, rebalance the remaining steps
 Performance is potentially greater due to shorter the clock cycle
 Multiple issue
 Issue 3 to 8 instructions in every clock cycle
 Static multiple issue: determined at compile time
 Dynamic multiple issue: determined during execution
 Two problems
 How to package instruction into issue slots (by compiler or hardware)
 Dealing with data and control hazards (by compiler or hardware)
57
Speculation: Find and Exploit more ILP
 Speculation
 An approach that allows the compiler or the processor to “guess”
the outcome of an instruction to remove it as a dependence in
executing other instructions
 E.g. branch, store before load
 How it works
 Compiler or processor use speculation to
 reorder instructions,
 move an instruction across a branch or
 a load across a store
 Mechanism
 A method to check if guess right and a method to back out the
effects
 Difficulty: what if guess wrong (back-out capability)
58
Speculation: Find and Exploit more ILP

 Recovery mechanism for incorrect speculation

 Software approach
 Compiler inserts additional instructions to
 Check the accuracy of the speculation
 Provide a fix-up routine

 Hardware approach
 Buffer the speculative results until no longer speculative
 If correct, complete the instruction (write results to registers)
 If incorrect, flush the buffer and re-execute the correct one

59
Speculation: Find and Exploit more ILP

 Other possible problem: Exception in speculative instruction

 Speculating on certain instructions may introduce
exceptions that were formerly not present
 E.g. If executing “load” in speculative, but the address is illegal,
then “exception” that should not happen will occur. (Exception
should occur when load is not speculative)
 Compiler-based speculation
 Allow such exceptions ignored until they should occur
 Hardware-based speculation
 Buffer such exceptions until no longer speculative, then raise the
exception

60
Static Multiple Issue
 Compiler assist packaging instruction and handling data
hazards
 Issue packet
 As one large instruction with multiple operations
 VLIW: very long instruction word
 EPIC: Explicitly Parallel Instruction Computer (IA-64)
 Variation: how compiler handle hazards
 1. Compilers handle all hazards, schedule code, and insert code
 2. Compiler handle all dependences within an instruction, and
hardware detects data hazards and generates stalls between
two issue packets

61
Two-Issue MIPS Processor
 Static two-issue pipeline (64-bits IF and ID)

ALU or branch IF ID EX MEM WB

Load or store IF ID EX MEM WB
 Extra hardware ALU or branch IF ID EXE MEM WB
Load or store IF ID EXE MEM WB
 Register file
 2 read for ALU, 2 read for store, one write for ALU, one write for load
 Separated adder for address calculation of data transfers
 Performance
 Improve up to a factor of 2 (upper bound)
 In reality, it depends on how you schedule the instructions.
Compiler takes on this role.

62
Multiple Issue Code Scheduling
Original : add scalar $s2 to array
Loop:
lw $t0, 0($s1) #$t0 = array element
addu $t0, $t0, $s2 #add scalar in $s2
sw $t0, 0($s1) #store result
addi $s1, $s1, -4 #decrement pointer
bne $s1, $zero,loop #branch $s1 != 0

 Note. The result of a load cannot be used on the next clock

cycle due to load-use dependency.
Scheduled code for two－issue MIPS (4 cycles)
ALU or branch inst. Data transfer inst.
Loop:
lw $t0, 0($s1)
addi $s1, $s1, -4
addu $t0, $t0, $s2
bne $s1, $zero, loop sw $t0, 4($s1)

63
Loop Unrolling for 2-Issue MIPS
 To get more performance from loops: loop unrolling
 Assume the loop index is multiple of four
 Unroll four loop: register renaming to remove antidependences
Loop:
addi $s1, $s1, -16 lw $t0, 0($s1)
lw $t1, 12($s1)
addu $t0, $t0, $s2 lw $t2, 8($s1)
addu $t1, $t1, $s2 lw $t3, 4($s1)
addu $t2, $t2, $s2 sw $t0, 16($s1)
addu $t3, $t3, $s2 sw $t1, 12($s1)
sw $t2, 8($s1)
beq $s1, $zero, loop sw $t3, 4($s1)
for (i = 0; i < 16; i++){ for (i = 0; i < 16; i+4){
array[i] = array[i] + scalar; array[i] = array[i] + scalar;
} array[i+1] = array[i+1] + scalar;
array[i+2] = array[i+2] + scalar;
array[i+3] = array[i+3] + scalar;
}
64
Dynamic Multiple
Issue Processors
 Suplerscalar
 Instruction issue in order
 0, 1 or more instructions can issue in a give clock cycle
 To achieve good performance
 Needs compiler to schedule instructions
 More important: hardware guarantees instructions are executed
correctly whether scheduled or not
 Extension: dynamic pipeline scheduling
 Hardware support to reorder the execution order to avoid stalls

65
Dynamic Pipeline Scheduling
Instruction fetch
and decode unit In-order issue

Reservation Reservation Reservation Reservation

…
station station station station

Functional Integer … Floating Load/

Integer Out-of-order execute
units point Store

Commit
unit
In-order commit

66
Summary
 Performance is specific to a particular program/s
 Total execution time is a consistent summary of performance
 For a given architecture performance increases come from:
 increases in clock rate (without adverse CPI affects)
 improvements in processor organization that lower CPI
 compiler enhancements that lower CPI and/or instruction count
 Algorithm/Language choices that affect instruction count
 Amdahl’s law

68
See You Next Class!

(FREE JOB) Home Based Work Without Registration Fees or Investment, Free Online Data Entry Jobs Work From Home, Part Time Typing Jobs
100% (5)
(FREE JOB) Home Based Work Without Registration Fees or Investment, Free Online Data Entry Jobs Work From Home, Part Time Typing Jobs
1 page
Deferred and Supplementary Final Exam ECON339 2022
No ratings yet
Deferred and Supplementary Final Exam ECON339 2022
8 pages
Pipeline Datapaths: Pipelined Datapath and Control
No ratings yet
Pipeline Datapaths: Pipelined Datapath and Control
16 pages
Pipelined Datapath and Control
No ratings yet
Pipelined Datapath and Control
26 pages
Pipelining 2
No ratings yet
Pipelining 2
33 pages
L11 Pipelined Datapath and
100% (1)
L11 Pipelined Datapath and
31 pages
L24 Pipeline
No ratings yet
L24 Pipeline
40 pages
Chapter V Processor Architecture
No ratings yet
Chapter V Processor Architecture
140 pages
Risc in Pipe Ine
No ratings yet
Risc in Pipe Ine
39 pages
Pipe 2 New
No ratings yet
Pipe 2 New
41 pages
Explain Datapath in Pipeline or Pipelined Datapath?
No ratings yet
Explain Datapath in Pipeline or Pipelined Datapath?
4 pages
Lecture8 Cda3101
No ratings yet
Lecture8 Cda3101
75 pages
Chapter Six: 2004 Morgan Kaufmann Publishers
No ratings yet
Chapter Six: 2004 Morgan Kaufmann Publishers
25 pages
MIPS Pipeline: Data and Control Path Data and Control Path
No ratings yet
MIPS Pipeline: Data and Control Path Data and Control Path
46 pages
8 Pipeline DDP Control
No ratings yet
8 Pipeline DDP Control
54 pages
Lec 11
No ratings yet
Lec 11
30 pages
Pipelining ControlUnitAndHazards
No ratings yet
Pipelining ControlUnitAndHazards
109 pages
Lec7 Pipelining
No ratings yet
Lec7 Pipelining
22 pages
Controlling A Pipelined Datapath
No ratings yet
Controlling A Pipelined Datapath
17 pages
L7 Single Cycle DP
No ratings yet
L7 Single Cycle DP
24 pages
Pipelining in MIPs Architecture
100% (3)
Pipelining in MIPs Architecture
23 pages
PIPELINING
No ratings yet
PIPELINING
30 pages
Pipelining Updated
No ratings yet
Pipelining Updated
39 pages
Lecture # Pipelining and Datahazards
No ratings yet
Lecture # Pipelining and Datahazards
70 pages
L15 MipsPipeline
No ratings yet
L15 MipsPipeline
26 pages
Embedded Computer Architecture 5SAI0
No ratings yet
Embedded Computer Architecture 5SAI0
59 pages
CH 2
No ratings yet
CH 2
50 pages
Chap 4 1
No ratings yet
Chap 4 1
57 pages
Cpu Data Path: Professor Michael Mcgarry
No ratings yet
Cpu Data Path: Professor Michael Mcgarry
8 pages
Lec07 Annotated
No ratings yet
Lec07 Annotated
26 pages
CS 162 Computer Architecture Lecture 3: Pipelining Contd.: Instructor: L.N. Bhuyan
No ratings yet
CS 162 Computer Architecture Lecture 3: Pipelining Contd.: Instructor: L.N. Bhuyan
21 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
Enhancing Performance With Pipelining
No ratings yet
Enhancing Performance With Pipelining
71 pages
Single Cycle Mips
No ratings yet
Single Cycle Mips
25 pages
CA Unit 3 Answers
No ratings yet
CA Unit 3 Answers
10 pages
Chap 4 1
No ratings yet
Chap 4 1
57 pages
Unit 5 Pipeline Hazard
No ratings yet
Unit 5 Pipeline Hazard
31 pages
EC Chapter2 2014
No ratings yet
EC Chapter2 2014
88 pages
The Processor: (Datapath and Pipelining)
No ratings yet
The Processor: (Datapath and Pipelining)
144 pages
04 The+processor
No ratings yet
04 The+processor
11 pages
II and IE - 2
No ratings yet
II and IE - 2
23 pages
Chapter 04.Ppt - Chapter 04
No ratings yet
Chapter 04.Ppt - Chapter 04
182 pages
Enhancing Performance With Pipelining
No ratings yet
Enhancing Performance With Pipelining
85 pages
Pipelining 3
No ratings yet
Pipelining 3
37 pages
Forwarding Assignment
No ratings yet
Forwarding Assignment
35 pages
Presentation 35191 Content Document 20250423021246PM
No ratings yet
Presentation 35191 Content Document 20250423021246PM
46 pages
Chapter 04
No ratings yet
Chapter 04
131 pages
Ca07 2014 PDF
No ratings yet
Ca07 2014 PDF
56 pages
Chapter 04MHE Kabir
No ratings yet
Chapter 04MHE Kabir
171 pages
CPU Structure & Functions
No ratings yet
CPU Structure & Functions
44 pages
w9 One PDF
No ratings yet
w9 One PDF
37 pages
W9 Config
No ratings yet
W9 Config
37 pages
4 The Processors
No ratings yet
4 The Processors
112 pages
Lec12 DataPath
No ratings yet
Lec12 DataPath
43 pages
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
97 pages
Slide 5
No ratings yet
Slide 5
31 pages
CA04 2024S2 Printout
No ratings yet
CA04 2024S2 Printout
31 pages
Cpu Supprot Material
No ratings yet
Cpu Supprot Material
5 pages
Advanced Linux Programming
No ratings yet
Advanced Linux Programming
31 pages
Design of 3 Stage Pipelining Processor Using VHDL
No ratings yet
Design of 3 Stage Pipelining Processor Using VHDL
22 pages
Patterson6e MIPS Ch04
No ratings yet
Patterson6e MIPS Ch04
137 pages
考試ch5
No ratings yet
考試ch5
12 pages
考試ch7
No ratings yet
考試ch7
12 pages
CH 5
No ratings yet
CH 5
68 pages
CH 6
No ratings yet
CH 6
29 pages
CHAPTER 3 Simple Resistive Circuits
No ratings yet
CHAPTER 3 Simple Resistive Circuits
12 pages
Patch Management
No ratings yet
Patch Management
57 pages
Tencent HTC M&A: 6th Team Ntufin Hank Wu Wilson Hsieh Wu Meng Yan Ntuib Yeh Yu Chen Ian Hung
No ratings yet
Tencent HTC M&A: 6th Team Ntufin Hank Wu Wilson Hsieh Wu Meng Yan Ntuib Yeh Yu Chen Ian Hung
33 pages
Business Analytics
100% (1)
Business Analytics
10 pages
Deadlock - Lecture 5
No ratings yet
Deadlock - Lecture 5
17 pages
IP ROUTING (Unit III)
No ratings yet
IP ROUTING (Unit III)
38 pages
QP - 12-CS - PB-I 23-24 Set 1
No ratings yet
QP - 12-CS - PB-I 23-24 Set 1
10 pages
Cbse Class 10 Maths Pre Board Sample Paper For 2023 24
No ratings yet
Cbse Class 10 Maths Pre Board Sample Paper For 2023 24
7 pages
Ge B90 Gek-131050 PDF
No ratings yet
Ge B90 Gek-131050 PDF
522 pages
E50417-H8940-C592-A1 en Manual SICAM FCM Configurator
No ratings yet
E50417-H8940-C592-A1 en Manual SICAM FCM Configurator
48 pages
Overload Function-Past Papers
No ratings yet
Overload Function-Past Papers
4 pages
Nigerian Air Force
No ratings yet
Nigerian Air Force
1 page
Chandigarh Administration Chandigarh Police: JAN SAMPARK: Information Gateway of Chandigarh Administration: 1 of 1
No ratings yet
Chandigarh Administration Chandigarh Police: JAN SAMPARK: Information Gateway of Chandigarh Administration: 1 of 1
1 page
Year 9 Scheme and Note
No ratings yet
Year 9 Scheme and Note
43 pages
Introduction To ROC Analysis: Pattern Recognition Letters June 2006
No ratings yet
Introduction To ROC Analysis: Pattern Recognition Letters June 2006
16 pages
Guide For Candidates Calibrand v2
No ratings yet
Guide For Candidates Calibrand v2
4 pages
Kawasaki FastCheck
No ratings yet
Kawasaki FastCheck
18 pages
I BCA - CPP Lab
No ratings yet
I BCA - CPP Lab
57 pages
Resume Biography
100% (2)
Resume Biography
7 pages
Cirrus: SR22 / SR22T WM Temporary Revision 24-50-02 Electrical Power
No ratings yet
Cirrus: SR22 / SR22T WM Temporary Revision 24-50-02 Electrical Power
4 pages
3706durgam Cheruvu CADASTRAL
No ratings yet
3706durgam Cheruvu CADASTRAL
1 page
BADS (KMBA 106) - Qus Bank
No ratings yet
BADS (KMBA 106) - Qus Bank
7 pages
Student Management System Proposal Slide PDF
No ratings yet
Student Management System Proposal Slide PDF
16 pages
BAIS Exam
No ratings yet
BAIS Exam
4 pages
Netflix Premium Cookie 1
No ratings yet
Netflix Premium Cookie 1
3 pages
Teaching and Learning Aids To Support The Deaf Students Studying
No ratings yet
Teaching and Learning Aids To Support The Deaf Students Studying
18 pages
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
No ratings yet
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
41 pages
Fault Diagnosis and Fault Tolerant Control of A Three-Phase VSI Supplying Sensorless Speed Controlled Induction Motor Drive
No ratings yet
Fault Diagnosis and Fault Tolerant Control of A Three-Phase VSI Supplying Sensorless Speed Controlled Induction Motor Drive
17 pages
System Software and Languages
No ratings yet
System Software and Languages
55 pages