CA L7 Unit4 Slides Updated

Download as pdf or txt
Download as pdf or txt
You are on page 1of 144

Computer Architecture

(EE 371 / CE 321 / CS 330)

Dr. Farhan Khan


Assistant Professor,
Dhanani School of Science & Engineering,
Habib University

1
Outline: Unit 4

• Pipelined Processor Implementation

• Pipeline Hazards

• Pipelined Datapath

• Control of Pipelined Datapath

2
Combination of Combinational and Sequential Logic
Elements

• Because only state elements can store a data value, any collection of
combinational logic must have its inputs come from a set of state
elements and its outputs written into a set of state elements

• When two state elements surround a block of combinational logic


– All signals must propagate from state element 1, through the combinational logic, and
to state element 2 in the time of one clock cycle.
– The time necessary for the signals to reach state element 2 defines the length of the
clock cycle

3
Datapath with Control

4
RISC-V Instruction Execution Steps

• RISC-V instructions classically take five steps:


1. IF: Fetch instruction from memory.
2. ID: Decode instruction and read registers
3. EX: Execute the operation or calculate an address.
4. MEM: Access an operand in data memory (if necessary).
5. WB: Write the result back into a register (if necessary).

5
Steps in RISC-V Instruction Execution

6
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

7
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

Clock Cycle: 1 8
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

Clock Cycle: 2 9
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

Clock Cycle: 3 10
Processor Implementation

• Classical performance equation:

• Instruction Count
– Determined by Compiler and Instruction Set Architecture
• CPI and Clock cycle time
– Determined by Processor Implementation

11
Performance Issues of Single Cycle Processor Design

• CPI is 1.

• However, clock cycle time is determined by the longest possible path/delay


in the data path.
– Critical Path: Load Instruction
– Load instruction uses five functional units in series: Instruction memory  register file 
ALU  data memory  register file

12
Steps in RISC-V Instruction Execution

200 ps 100 ps 200 ps 200 ps 100 ps

13
RISC-V Performance: Single Cycle

• Assume that the operation times for the major functional units of data
path are :
– 100ps for register read or write
– 200 ps for memory access for instructions or data,
– 200 ps for ALU operation

Single-Cycle Implementation
• Clock cycle time must support the
longest instruction (i.e. ld)
• Clock Cycle Time : 800 ps

14
Steps in RISC-V Instruction Execution

200 ps 100 ps 200 ps 200 ps 100 ps

15
Clock Cycle = 800
RISC-V Performance: Single Cycle

Single-cycle

16
Performance Issues of Single Cycle Processor Design

• CPI is 1.

• However, clock cycle time is determined by the longest possible path/delay


in the data path.
– Critical Path: Load Instruction
– Load instruction uses five functional units in series: Instruction memory  register file 
ALU  data memory  register file

• Although CPI is 1, the overall performance of a signal cycle processor


implementation is poor because the clock cycle is too long.

17
Performance Issues of Single Cycle Processor Design

• CPI is 1.

• However, clock cycle time is determined by the longest possible path/delay


in the data path.
– Critical Path: Load Instruction
– Load instruction uses five functional units in series: Instruction memory  register file 
ALU  data memory  register file

• Although CPI is 1, the overall performance of a single cycle processor


implementation is poor because the clock cycle is too long.

• We can improve performance by Pipelining

18
Pipelining Analogy

19
Pipelining Analogy

• Pipelining improves throughput of our laundry system.


• Pipelining would not decrease the time to complete one load of laundry
but when we have many loads of laundry to do, the improvement in
throughput decreases the total time to complete the work. 20
RISC-V Pipeline

• RISC-V instructions classically take five steps:


1. IF: Fetch instruction from memory.
2. ID: Decode instruction and read registers
3. EX: Execute the operation or calculate an address.
4. MEM: Access an operand in data memory (if necessary).
5. WB: Write the result back into a register (if necessary).

• Hence, the first RISC-V pipeline we explore has five stages: one step per
stage

21
Steps in RISC-V Instruction Execution

22
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

23
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

Clock Cycle: 1 24
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

Clock Cycle: 2 25
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

Clock Cycle: 3 26
RISC-V Pipeline Performance

• Assume that the operation times for the major functional units of data
path are :
– 100ps for register read or write
– 200 ps for memory access for instructions or data,
– 200 ps for ALU operation

27
Steps in RISC-V Instruction Execution

200 ps 100 ps 200 ps 200 ps 100 ps

28
Steps in RISC-V Instruction Execution

200 ps 100 ps 200 ps 200 ps 100 ps

29
Clock Cycle = 800
Steps in RISC-V Instruction Execution: Pipelined

200 ps 100 ps 200 ps 200 ps 100 ps

30
Steps in RISC-V Instruction Execution: Pipelined

200 ps 100 ps 200 ps 200 ps 100 ps

Clock Cycle = 200 ps 31


RISC-V Pipeline Performance

• Assume that the operation times for the major functional units of data
path are :
– 100ps for register read or write
– 200 ps for memory access for instructions or data,
– 200 ps for ALU operation

Single-Cycle Implementation Pipelined Implementation


• Clock cycle time must support the • Each clock must be able to support the
longest instruction (i.e. ld) slowest stage.
• Clock Cycle Time : 800 ps • Clock Cycle Time : 200 ps

32
RISC-V Pipeline

Single-cycle

Pipelined

33
RISC-V Pipeline

Single-cycle

Pipelined

Single-Cycle Implementation Pipelined Implementation


Time between Instructions = 800 ps Time between instructions = 200 ps

34
RISC-V Pipeline

Single-cycle

Pipelined

Single-Cycle Implementation Pipelined Implementation


Total execution time Total execution time
for three instructions = 2400 ps for three instructions = 1400 ps

Improvement = 2400/1400 = 1.71 35


RISC-V Pipeline

Single-cycle

Pipelined

Single-Cycle Implementation Pipelined Implementation

Total execution time Total execution time


for1,000,003 instructions for 1,000,003 instructions
= (1000,000 x 800) + 2400 ps = (1000,000 x 200) + 1400 ps
= 800,002,400 ps = 200,001,400
Improvement = 800,002,400/200,001,400 ≈ 4.00 36
RISC-V Pipeline

Single-cycle

Pipelined

• Pipelining improves performance by increasing instruction


throughput, in contrast to decreasing the execution time of an
individual instruction.

• Instruction throughput is the important metric because real


programs execute millions of instructions. 37
Performance Improvement through Pipelining

• Under ideal conditions and if the pipeline stages are balanced (i.e. all the
stages took the same operational time)

38
Performance Improvement through Pipelining

• Under ideal conditions and if the pipeline stages are balanced (i.e. all the
stages took the same operational time)

• Due to unbalanced pipeline stages and pipelining overheads (discussed in


next lectures), this formula only provides an upper limit on the
performance improvement.

39
Steps in RISC-V Instruction Execution

200 ps 100 ps 200 ps 200 ps 100 ps

40
Steps in RISC-V Instruction Execution: Unbalanced Pipeline

200 ps 100 ps 200 ps 200 ps 100 ps

41
Steps in RISC-V Instruction Execution

200 ps 200 ps 200 ps 200 ps 200 ps

42
Steps in RISC-V Instruction Execution: Balanced Pipeline

200 ps 200 ps 200 ps 200 ps 200 ps

43
Pipeline Hazards

• These are situations in pipelining when the next instruction cannot execute
in the following clock cycle.

44
Pipeline Hazards

• These are situations in pipelining when the next instruction cannot execute
in the following clock cycle.

• Consider the following instruction sequence:


add x19, x0, x1
sub x2, x19, x3

45
Pipeline Hazards

• These are situations in pipelining when the next instruction cannot execute
in the following clock cycle.

• Consider the following instruction sequence:


add x19, x0, x1
sub x2, x19, x3

46
Pipeline Hazards

• These are situations in pipelining when the next instruction cannot execute
in the following clock cycle.

• There are 3 types of pipeline hazards

1. Structural Hazard
– A required hardware resource is busy

2. Data Hazard
– Data that is needed to execute the next instruction has not yet become available

3. Control Hazard
– Deciding on control action (or instruction sequence) depends on previous instruction that
has not yet been completed

47
Structural Hazards

• A planned instruction cannot execute in the proper clock cycle because


a hardware resource is busy due to a previous instruction

• Example
– In RISC-V pipeline with a single memory, Load/store requires data access to this single
memory

– Instruction fetch (for a future instruction) would have to stall for that cycle

– Hence, pipelined datapaths require separate instruction/data memories or separate


instruction/data cache

48
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

Clock Cycle: 3 49
RISC-V Pipeline

Single-cycle

Pipelined

50
Data Hazards

• In a processor pipeline, data hazards arise from the dependence of one


instruction on data that is yet to be produced by an earlier instruction (still
in the pipeline).

51
Data Hazards

• In a processor pipeline, data hazards arise from the dependence of one


instruction on data that is yet to be produced by an earlier instruction (still
in the pipeline).

• Example
– add x19, x0, x1
sub x2, x19, x3

52
Solving Data Hazards: Forwarding (Bypassing)

• Use result when it is computed


– Don’t wait for it to be stored in a register
– Requires extra connections in the datapath

• Example
– add x19, x0, x1
sub x2, x19, x3

53
Solving Data Hazards: Limitations of Forwarding

• Forwarding cannot always avoid pipeline stalls.


– Value might not even be computed when it is needed

• Load-Use Data Hazard


– ld x1, 0(x2)
sub x4, x1, x5

54
Solving Data Hazards: Limitations of Forwarding

• Forwarding cannot always avoid pipeline stalls.


– Value might not even be computed when it is needed

• Load-Use Data Hazard


– ld x1, 0(x2)
sub x4, x1, x5

55
Solving Data Hazards: Code Scheduling

• Reorder code to avoid use of load result in the next instruction


• C code:
– a = b + e; c = b + f;

56
Solving Data Hazards: Code Scheduling

• Reorder code to avoid use of load result in the next instruction


• C code:
– a = b + e; c = b + f;

57
Solving Data Hazards: Code Scheduling

• Reorder code to avoid use of load result in the next instruction


• C code:
– a = b + e; c = b + f;

58
Control Hazards

• Deciding on next instruction (to be fetched) in the instruction sequence


depends on previous branch instruction that has not yet been completed

59
Control Hazards

• Deciding on next instruction (to be fetched) in the instruction sequence


depends on previous branch instruction that has not yet been completed

60
Steps in RISC-V Instruction Execution

ld x1, 100(x4)
ld x2, 200(x4)
ld x3, 300(x4)

61
Steps in RISC-V Instruction Execution

Let’s add extra hardware so we can test a register,


calculate the branch address, and update the PC during the second stage of
the pipeline
62
Control Hazards

• Deciding on next instruction (to be fetched) in the instruction sequence


depends on previous branch instruction that has not yet been completed

Even in the presence of additional hardware (mentioned in the last slide),


63
there is a pipeline stall.
Handling Control Hazards: Branch Prediction

• Predict the outcome of the conditional branch


– When the prediction turns out to be right, the pipeline proceeds at full speed.
– When the prediction turns out to be wrong, the pipeline control must ensure that the
instructions following the wrongly guessed conditional branch have no effect and must
restart the pipeline from the proper branch address

64
Steps in RISC-V Instruction Execution

beq x1,x4, 100


ld x2, 200(x4)
ld x3, 300(x4)

Clock Cycle: 3 65
Handling Control Hazards: Types of Branch Prediction

• Static Branch Prediction:


– Rigid branch prediction rules based on typical branch behavior
Examples
1. Predict that conditional branches will always be untaken
2. At the bottom of loops are conditional branches that branch back to the top of the loop.
Since they are likely to be taken and they branch backward, we could always predict
taken for conditional branches that branch to an earlier address.

• Dynamic Branch Prediction


– Hardware measures actual branch behavior. Then, recent history is used to predict the
outcome of a branch

66
Designing Instruction Set Architecture (ISA) for Pipelining

• RISC-V ISA has been designed for pipelining. As can be seen from the
following design choices:

1) RISC-V instructions are the same length.


– This restriction makes it much easier to fetch instructions in the first pipeline stage and to
decode them in the second stage.

– In an instruction set like the x86, where instructions vary from 1 byte to 15 bytes,
pipelining is considerably more challenging.

– Modern implementations of the x86 architecture actually translate x86 instructions into
simple operations that look like RISC-V instructions and then pipeline the simple
operations rather than the native x86 instructions!

67
Designing Instruction Set Architecture (ISA) for Pipelining

• RISC-V ISA has been designed for pipelining. As can be seen from the
following design choices:

2) RISC-V only has a few instruction formats, with the source and
destination register fields being located in the same place in each
instruction
– This design choice makes it much easier to decode instructions and read registers in one
step.

68
Designing Instruction Set Architecture (ISA) for Pipelining

• RISC-V ISA has been designed for pipelining. As can be seen from the
following design choices:

3) Memory operands only appear in loads or stores in RISC-V.


– This restriction means we can use the execute stage to calculate the memory address and
then access memory in the following stage.

– If we could operate on the operands in memory, as in the x86, stages 3 and 4 would
expand to an address stage, memory stage, and then execute stage.

– We will shortly see the challenges of longer pipelines

69
Pipelining: Overview

• Pipelining improves performance by increasing instruction throughput


– Executes multiple instructions in parallel
– Each instruction has the same latency

• Subject to hazards
– Structure, data, control

• Instruction set design affects complexity of pipeline implementation

70
Pipelining Hazards: Exercise
ld x10, 0(x10)
add x11, x10, x10

71
Pipelining Hazards: Exercise
add x11, x10, x10
addi x12, x10, 5
addi x14, x11, 5

72
Pipeline Diagrams

• Single Clock-Cycle Pipeline Diagrams


– Shows pipeline usage in a single cycle
– Highlight resources used

• Multi Clock-Cycle Pipeline Diagram


– Graph of operation over time

73
Single Clock-Cycle Pipeline Diagram: Example

• Focusing on a single instruction’s passage through pipeline stages

74
Single Clock-Cycle Pipeline Diagram: Example

• State of pipeline in a given clock cycle

75
Multiple Clock-Cycle Pipeline Diagram: Example

• The form showing resource usage

76
Multiple Clock-Cycle Pipeline Diagram: Example

• Traditional form

77
Pipeline Operation for Load Instruction: IF

78
Pipeline Operation for Load Instruction: ID

79
Pipeline Operation for Load Instruction: EX

80
Pipeline Operation for Load Instruction: MEM

81
Pipeline Operation for Load Instruction: WB

82
Flaw in Pipeline Operation for Load Instruction: WB

Wrong
register
number

83
Corrected Pipelined Datapath for Load Instruction

84
Pipeline Operation for Store Instruction: IF

85
Pipeline Operation for Store Instruction: ID

86
Pipeline Operation for Store Instruction: EX

87
Pipeline Operation for Store Instruction: MEM

88
Pipeline Operation for Store Instruction: WB

89
Datapath with Control

90
Implementation of Main Control Unit

• Control signals are derived


from binary encoded
instructions.

• The Main Control Unit can set


all control signals, except
PCSrc, based solely on the
opcode and funct fields of the
instruction.

• To generate PCSrc signal,


Branch signal from the Main
Control Unit is ANDed with
the Zero signal from the ALU.

91
Implementation of Main Control Unit

Truth Table
• Control signals are derived
from binary encoded
instructions.

• The Main Control Unit can set


all control signals, except
PCSrc, based solely on the
opcode field of the
instruction.

• To generate PCSrc signal,


Branch signal from the Main
Control Unit is ANDed with
the Zero signal from the ALU.

92
Pipelined Datapath with Control

93
Grouping of Control Signals by Pipeline Stages: IF & ID

• No control signal is used in ID stages


94
Grouping of Control Signals by Pipeline Stages: EX

95
Grouping of Control Signals by Pipeline Stages: MEM

96
Grouping of Control Signals by Pipeline Stages: WB

97
Implementation of Main Control Unit

Truth Table
• Control signals are derived
from binary encoded
instructions.

• Since control lines are used in


EX, MEM, and WB stage, the
Main Control Unit can create
all the control information
during ID stage that is then
used by later stages.

• The simplest way of passing


these control signals to later
stages is to extended the
pipeline registers to include
control information.
98
Pipelined Datapath with Control

99
Data Hazards and Forwarding: Implementation Issues

• Detecting the need to forward

• Implementation of forwarding

• Detecting the data hazard that cannot be solved by forwarding

• How to stall the pipeline

100
Data Hazards

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

101
Data Hazards

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

102
Data Hazards

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

103
Detecting the Need to Forward

• Pass source and destination register numbers along pipeline


– e.g., ID/EX.RegisterRs1 = register number for Rs1 sitting in ID/EX pipeline register

• ALU operand register numbers in EX stage are given by


– ID/EX.RegisterRs1, ID/EX.RegisterRs2

• Data hazards when Fwd from


– 1a. EX/MEM.RegisterRd == ID/EX.RegisterRs1 EX/MEM
pipeline reg
– 1b. EX/MEM.RegisterRd == ID/EX.RegisterRs2
– 2a. MEM/WB.RegisterRd == ID/EX.RegisterRs1 Fwd from
– 2b. MEM/WB.RegisterRd == ID/EX.RegisterRs2 MEM/WB
pipeline reg

104
Pipelined Datapath

105
Detecting the Need to Forward

• But only if forwarding instruction will write to a register!


– EX/MEM.RegWrite, MEM/WB.RegWrite

• And only if Rd for that instruction is not x0


– EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

106
Data Hazards and Forwarding

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

107
Data Hazards and Forwarding

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

108
Forwarding Pathways

109
Forwarding Pathways

110
Forwarding Conditions

• EX Hazard (Forwarding from EX/MEM Pipeline Register)


– if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd == ID/EX.RegisterRs1)) ForwardA = 10

– if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd == ID/EX.RegisterRs2)) ForwardB = 10

111
Forwarding Conditions

• MEM hazard (Forwarding from MEM/WB Pipeline Register)


– if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs1)) ForwardA = 01

– if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs2)) ForwardB = 01

112
Double Data Hazard

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

• Consider the sequence:


add x1,x1,x2
add x1,x1,x3
add x1,x1,x4

113
Data Hazards and Forwarding

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

114
Data Hazards and Forwarding

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

115
Double Data Hazard

• Consider this sequence:


sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)

• Consider the sequence:


add x1,x1,x2
add x1,x1,x3
add x1,x1,x4

• Hazards occur from instructions in both MEM and WB stage


– Want to use the most recent one in MEM stage
• Revise forwarding condition from MEM/WB Pipeline Register
– Only fwd if forwarding condition for EX/MEM register isn’t true 116
Revised Forwarding Conditions

• MEM hazard (Forwarding from MEM/WB Pipeline Register)


– if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs1))
and (MEM/WB.RegisterRd == ID/EX.RegisterRs1)) ForwardA = 01

– if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs2))
and (MEM/WB.RegisterRd == ID/EX.RegisterRs2)) ForwardB = 01

117
Pipelined Datapath with Forwarding

118
Solving Data Hazards: Limitations of Forwarding

• Forwarding cannot always avoid pipeline stalls.


– Value might not even be computed when it is needed

• Load-Use Data Hazard


– ld x1, 0(x2)
sub x4, x1, x5

119
Load-Use Data Hazard

120
Load-Use Data Hazard: Need for Hazard Detection Unit

• In addition to a Forwarding Unit, we need a Hazard Detection Unit.


– It operates during the ID stage so that it can insert the stall between the load and the
instruction dependent on it.

121
Load-Use Data Hazard: Need for Hazard Detection Unit

• In addition to a Forwarding Unit, we need a Hazard Detection Unit.


– It operates during the ID stage so that it can insert the stall between the load and the
instruction dependent on it.

• In ID stage, Hazard Detection Unit uses the following condition to test for
the occurrence of load-use data hazard
– ID/EX.MemRead and
((ID/EX.RegisterRd = IF/ID.RegisterRs1) or
(ID/EX.RegisterRd = IF/ID.RegisterRs2))

• If load-use data hazard is detected, Hazard Detection Unit stalls the


pipeline (and inserts a bubble)

122
How to Stall the Pipeline

• Force control values in ID/EX register to 0


– As a result, EX, MEM and WB stages do nop (no-operation)

• Prevent update of PC and IF/ID register


– Using instruction is decoded again
– Following instruction is fetched again

123
Load-Use Data Hazard: Need for Hazard Detection Unit

• In addition to a Forwarding Unit, we need a Hazard Detection Unit.


– It operates during the ID stage so that it can insert the stall between the load and the
instruction dependent on it.

• In ID stage, Hazard Detection Unit uses the following condition to test for
the occurrence of load-use data hazard
– ID/EX.MemRead and
((ID/EX.RegisterRd = IF/ID.RegisterRs1) or
(ID/EX.RegisterRd = IF/ID.RegisterRs1))

• If load-use data hazard is detected, Hazard Detection Unit stalls the
pipeline (and inserts a bubble)
– The hazard detection unit controls the writing of the PC and IF/ID registers plus the
multiplexor that chooses between the real control values and all 0s

124
Load-Use Data Hazard: How to Stall the Pipeline

125
Load-Use Data Hazard: How to Stall the Pipeline

126
Load-Use Data Hazard: How to Stall the Pipeline

127
Load-Use Data Hazard: How to Stall the Pipeline

128
Pipeline Hazards

• These are situations in pipelining when the next instruction cannot execute
in the following clock cycle.

• There are 3 types of pipeline hazards

1. Structural Hazard
– A required hardware resource is busy

2. Data Hazard
– Data that is needed to execute the next instruction has not yet become available

3. Control Hazard
– Deciding on control action (or instruction sequence) depends on previous instruction that
has not yet been completed

129
Control Hazards

• Deciding on next instruction (to be fetched) in the instruction sequence


depends on previous branch instruction that has not yet been completed

130
Control Hazards

• Consider this sequence:


36: sub x10, x4, x8
40: beq x1, x3, 16 // PC-relative branch
// to 40+16*2=72
44: and x12, x2, x5
48: or x13, x2, x6
52: add x14, x4, x2
56: sub x15, x6, x7
...
72: ld x4, 50(x7)

131
Control Hazards

• If branch outcome determined in MEM stage

132
Handling Control Hazards: Reducing Branch Delay

• Move hardware to determine outcome to ID stage


– Target address adder
– Register comparator

133
Control Hazards
• Consider this sequence:
36: sub x10, x4, x8
40: beq x1, x3, 16 // PC-relative branch
// to 40+16*2=72
44: and x12, x2, x5
48: or x13, x2, x6
52: add x14, x4, x2
56: sub x15, x6, x7
...
72: ld x4, 50(x7)

• Simple strategy:
– Let’s always assume that the branch is not taken

134
Example: Wrong Assumption about Branch Outcome

135
Example: Wrong Assumption about Branch Outcome

136
How to Convert an Instruction in Pipeline to NOP(Bubble)

137
How to Convert an Instruction in Pipeline to NOP(Bubble)

• To flush instructions in the IF stage, we add a control line, called IF.Flush, that zeros the
138
instruction field of the IF/ID pipeline register.
Control Hazards

• Deciding on next instruction (to be fetched) in the instruction sequence


depends on previous branch instruction that has not yet been completed

Even in the presence of additional hardware (mentioned in the last slide),


139
there is a pipeline stall.
Handling Control Hazards: Branch Prediction

• Predict the outcome of the conditional branch


– When the prediction turns out to be right, the pipeline proceeds at full speed.
– When the prediction turns out to be wrong, the pipeline control must ensure that the
instructions following the wrongly guessed conditional branch have no effect and must
restart the pipeline from the proper branch address

140
Handling Control Hazards: Types of Branch Prediction

• Static Branch Prediction:


– Rigid branch prediction rules based on typical branch behavior
Examples
1. Predict that conditional branches will always be untaken
2. At the bottom of loops are conditional branches that branch back to the top of the loop.
Since they are likely to be taken and they branch backward, we could always predict
taken for conditional branches that branch to an earlier address.

• Dynamic Branch Prediction


– Hardware measures actual branch behavior. Then, recent history is used to predict the
outcome of a branch

141
Dynamic Branch Prediction: Implementation

• Branch Prediction Buffer (aka Branch History Table)


– A small memory indexed by the lower portion of the address of the branch instruction.
– Stores information about the recent outcomes (taken/not taken) of the branch
– While executing the branch, look up the address of the instruction in branch history table
to see if the conditional branch was taken/not taken the last time this instruction was
executed, and begin fetching new instructions accordingly.
– If the prediction turns out to be wrong, flush the pipeline and update the prediction
information in the branch history table.

142
Dynamic Branch Prediction: Implementation

• Two Variant of Branch Prediction Buffer (aka Branch History Table)


– 1-bit Predictor
– 2-bit Predictor

• 1-bit Predictor
– Only 1 bit is used to keep the prediction information.
– At each misprediction, the prediction bit is inverted

• 2-bit Predictor
– 2 bits are used to keep the
prediction information
– The 2 bits are used to encode 4
states in the system

143
Dynamic Branch Prediction: Implementation

• Ideally, the accuracy of the predictor would match the taken branch
frequency for highly regular branches
– Consider the inner loop branch that branches nine times in a row, and then is not taken
once

• Shortcoming of One-bit Predictor


– Mispredict as taken on last iteration of inner loop
– Then mispredict as not taken on first iteration of inner loop next time around
– Thus, the prediction accuracy for this branch that is taken 90% of the time is only 80%
(two incorrect predictions and eight correct ones)
– 2-bit prediction schemes try to address this shortcoming
144

You might also like