EC Chapter2 2014
EC Chapter2 2014
Chapter 2. Enhancing performance with pipelining 1 Dept. of Computer Architecture, UMA, Oct 2014
Introduction
CPU performance factors
Instruction count
- Determined by ISA and compiler
CPI and Cycle time
- Determined by CPU hardware
Smaller is faster
limited instruction set
limited number of registers in register file
limited number of addressing modes
Generic implementation
use the program counter (PC) to supply Fetch
PC = PC+4
the instruction address and fetch the
instruction from memory (and update the PC) Exec Decode
decode the instruction (and read registers)
execute the instruction
clock
clock Add
4
Fetch
PC = PC+4 Instruction
Memory
Exec Decode Read
PC Instruction
Address
Fetch Control
PC = PC+4 Unit
Exec Decode
Read Addr 1
Register Read
Read Addr 2 Data 1
and Instruction
File
Write Addr
Read
Data 2
Write Data
Chapter 2. Enhancing performance with pipelining 7 Dept. of Computer Architecture, UMA, Oct 2014
Executing R Format Operations
R format operations (add, sub, slt, and, or)
31 25 20 15 10 5 0
R-type: op rs rt rd shamt funct
perform operation (op and funct) on values in rs and rt
store the result back into the Register File (into location rd)
Read Addr 1
Fetch Register Read
PC = PC+4 Instruction Read Addr 2 Data 1 overflow
File zero
ALU
Write Addr
Exec Decode Read
Data 2
Write Data
Note that Register File is not written every cycle (e.g. sw), so
we need an explicit write control signal for the Register File
Chapter 2. Enhancing performance with pipelining 8 Dept. of Computer Architecture, UMA, Oct 2014
Executing Load and Store Operations
Load and store operations involves
compute memory address by adding the base register (read from
the Register File during decode) to the 16-bit signed-extended
offset field in the instruction
store value (read from the Register File during decode) written to
the Data Memory
load value, read from the Data Memory, written to the Register
File RegWrite ALU control MemWrite
overflow
Read Addr 1 zero
Register Read
Address
Sign MemRead
16 Extend 32
Chapter 2. Enhancing performance with pipelining 9 Dept. of Computer Architecture, UMA, Oct 2014
Executing Branch Operations
Branch operations involves
compare the operands read from the Register File during decode
for equality (zero ALU output)
compute the branch target address by adding the updated PC to
the 16-bit signed-extended offset field in the instr
Add Branch
Add target
4 Shift address
left 2
ALU control
PC
Sign
16 Extend 32
Chapter 2. Enhancing performance with pipelining 10 Dept. of Computer Architecture, UMA, Oct 2014
Executing Jump Operations
Jump operation involves
replace the lower 28 bits of the PC with the lower 26 bits of the
fetched instruction shifted left by 2 bits
Add
4
4
Jump
Instruction Shift address
Memory
left 2 28
PC Read Instruction
Address 26
Chapter 2. Enhancing performance with pipelining 11 Dept. of Computer Architecture, UMA, Oct 2014
Creating a Single Datapath from the Parts
Assemble the datapath segments and add control lines
and multiplexors as needed
Single cycle design – fetch, decode and execute each
instructions in one clock cycle
no datapath resource can be used more than once per
instruction, so some must be duplicated (e.g., separate
Instruction Memory and Data Memory, several adders)
multiplexors needed at the input of shared elements with
control lines to do the selection
write signals to control writing to the Register File and Data
Memory
Chapter 2. Enhancing performance with pipelining 12 Dept. of Computer Architecture, UMA, Oct 2014
Fetch, R, and Memory Access Portions
Add
RegWrite ALUSrc ALU control MemWrite MemtoReg
4
ovf
zero
Read Addr 1
Instruction
Memory Register Read Address
Read Addr 2 Data 1
Read Data
PC Instruction File Memory Read Data
Address ALU
Write Addr Read
Data 2 Write Data
Write Data
MemRead
Sign
16 Extend 32
Chapter 2. Enhancing performance with pipelining 13 Dept. of Computer Architecture, UMA, Oct 2014
Adding the Control
Selecting the operations to perform (ALU, Register File
and Memory read/write)
Controlling the flow of data (multiplexor inputs)
31 25 20 15 10 5 0
R-type: op rs rt rd shamt funct
Observations 31 25 20 15 0
I-Type: op rs rt address offset
op field always
in bits 31-26 31 25 0
addr of registers J-type: op target address
to be read are
always specified by the
rs field (bits 25-21) and rt field (bits 20-16); for lw and sw rs is the base
register
addr. of register to be written is in one of two places – in rt (bits 20-16)
for lw; in rd (bits 15-11) for R-type instructions
offset for beq, lw, and sw always in bits 15-0
Chapter 2. Enhancing performance with pipelining 14 Dept. of Computer Architecture, UMA, Oct 2014
Single Cycle Datapath with Control Unit
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]
Chapter 2. Enhancing performance with pipelining 15 Dept. of Computer Architecture, UMA, Oct 2014
R-type Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]
Chapter 2. Enhancing performance with pipelining 16 Dept. of Computer Architecture, UMA, Oct 2014
Load Word Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]
Chapter 2. Enhancing performance with pipelining 17 Dept. of Computer Architecture, UMA, Oct 2014
Branch Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]
Chapter 2. Enhancing performance with pipelining 19 Dept. of Computer Architecture, UMA, Oct 2014
Adding the Jump Operation
Instr[25-0] 1
Shift
28 32
26 left 2
PC+4[31-28] 0
Add 0
Add 1
4 Shift
left 2 PCSrc
Jump
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]
Chapter 2. Enhancing performance with pipelining 21 Dept. of Computer Architecture, UMA, Oct 2014
Instruction Times (Critical Paths)
What is the clock cycle time assuming negligible
delays for muxes, control unit, sign extend, PC access,
shift left 2, wires, setup and hold times except:
Instruction and Data Memory (200 ps)
ALU and adders (200 ps)
Register File access (reads or writes) (100 ps)
Cycle 1 Cycle 2
Clk
lw sw Waste
Chapter 2. Enhancing performance with pipelining 25 Dept. of Computer Architecture, UMA, Oct 2014
Pipelining Analogy
Four loads:
Speedup
= 8/3.5 = 2.3
Non-stop:
Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
Chapter 2. Enhancing performance with pipelining 26 Dept. of Computer Architecture, UMA, Oct 2014
The Five Stages of Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Chapter 2. Enhancing performance with pipelining 27 Dept. of Computer Architecture, UMA, Oct 2014
A Pipelined MIPS Processor
Start the next instruction before the current one has
completed
improves throughput - total amount of work done in a given time
instruction latency (execution time, delay time, response time -
time from the start of an instruction to its completion) is not
reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
Chapter 2. Enhancing performance with pipelining 29 Dept. of Computer Architecture, UMA, Oct 2014
Pipeline Performance
Chapter 2. Enhancing performance with pipelining 30 Dept. of Computer Architecture, UMA, Oct 2014
Pipeline Performance
Single Cycle Implementation (CC = 800 ps):
Cycle 1 Cycle 2
Clk
lw sw Waste
Chapter 2. Enhancing performance with pipelining 32 Dept. of Computer Architecture, UMA, Oct 2014
Pipelining the MIPS ISA
What makes it easy
all instructions are the same length (32 bits)
- can fetch in the 1st stage and decode in the 2nd stage
few instruction formats (three) with symmetry across formats
- can begin reading register file in 2nd stage
memory operations occur only in loads and stores
- can use the execute stage to calculate memory addresses
each instruction writes at most one result (i.e., changes the
machine state) and does it in the last few pipeline stages (MEM
or WB)
operands must be aligned in memory so a single data transfer
takes only one data memory access
Chapter 2. Enhancing performance with pipelining 33 Dept. of Computer Architecture, UMA, Oct 2014
MIPS Pipeline Datapath Additions/Mods
State registers between each pipeline stage to isolate them
IF:IFetch ID:Dec EX:Execute MEM: WB:
MemAccess WriteBack
Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
File Address
Read
Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
Sign
16 Extend 32
System Clock
Chapter 2. Enhancing performance with pipelining 34 Dept. of Computer Architecture, UMA, Oct 2014
MIPS Pipeline Control Path Modifications
All control signals can be determined during Decode
and held in the state registers between pipeline stages
PCSrc
ID/EX
EX/MEM
Control
IF/ID
Add
Branch MEM/WB
RegWrite Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1 MemtoReg
Read ALUSrc
PC
File Address
Read
Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
ALU
cntrl
MemRead
Sign
16 Extend 32 ALUOp
RegDst
Chapter 2. Enhancing performance with pipelining 35 Dept. of Computer Architecture, UMA, Oct 2014
Pipeline Control
IF Stage: read Instr Memory (always asserted) and write
PC (on System Clock)
ID Stage: no optional control signals to set
Chapter 2. Enhancing performance with pipelining 36 Dept. of Computer Architecture, UMA, Oct 2014
Graphically Representing MIPS Pipeline
ALU
IM Reg DM Reg
Chapter 2. Enhancing performance with pipelining 37 Dept. of Computer Architecture, UMA, Oct 2014
Why Pipeline? For Performance!
Time (clock cycles)
Once the
ALU
I Inst 0 IM Reg DM Reg pipeline is full,
n one instruction
s is completed
ALU
t Inst 1 IM Reg DM Reg
every cycle, so
r. CPI = 1
ALU
O Inst 2 IM Reg DM Reg
r
d
ALU
e Inst 3 IM Reg DM Reg
r
ALU
Inst 4 IM Reg DM Reg
Chapter 2. Enhancing performance with pipelining 38 Dept. of Computer Architecture, UMA, Oct 2014
Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazards
structural hazards: attempt to use the same resource by two
different instructions at the same time
data hazards: attempt to use data before it is ready
- An instruction’s source operand(s) are produced by a prior
instruction still in the pipeline
control hazards: attempt to make a decision about program
control flow before the condition has been evaluated and the
new PC target address calculated
- branch and jump instructions, exceptions
Chapter 2. Enhancing performance with pipelining 40 Dept. of Computer Architecture, UMA, Oct 2014
A Single Memory Would Be a Structural Hazard
Time (clock cycles)
ALU
I lw Mem Reg Mem Reg
memory
n
s
ALU
t Inst 1 Mem Reg Mem Reg
r.
ALU
O Inst 2 Mem Reg Mem Reg
r
d
ALU
e Inst 3 Mem Reg Mem Reg
r
ALU
Inst 4 Mem Reg Mem Reg
Reading instruction
from memory
Fix with separate instr and data memories (I$ and D$)
Chapter 2. Enhancing performance with pipelining 41 Dept. of Computer Architecture, UMA, Oct 2014
How About Register File Access?
Time (clock cycles)
ALU
I add $1, IM Reg DM Reg access hazard by
n doing reads in the
s second half of the
ALU
t Inst 1 IM Reg DM Reg
cycle and writes in
r. the first half
ALU
O Inst 2 IM Reg DM Reg
r
d
ALU
e add $2,$1, IM Reg DM Reg
r
ALU
I add $1, IM Reg DM Reg
n
s
ALU
t sub $4,$1,$5 IM Reg DM Reg
r.
ALU
O and $6,$1,$7 IM Reg DM Reg
r
d
ALU
e or $8,$1,$9 IM Reg DM Reg
r
ALU
xor $4,$1,$5 IM Reg DM Reg
ALU
add $1, IM Reg DM Reg
ALU
sub $4,$1,$5 IM Reg DM Reg
ALU
and $6,$1,$7 IM Reg DM Reg
ALU
or $8,$1,$9 IM Reg DM Reg
ALU
xor $4,$1,$5 IM Reg DM Reg
ALU
I lw $1,4($2) IM Reg DM Reg
n
s
ALU
t sub $4,$1,$5 IM Reg DM Reg
r.
ALU
O and $6,$1,$7 IM Reg DM Reg
r
d
ALU
e or $8,$1,$9 IM Reg DM Reg
r
ALU
xor $4,$1,$5 IM Reg DM Reg
In MIPS pipeline
Need to compare registers and compute target early in the
pipeline
Add hardware to do it in ID stage
Chapter 2. Enhancing performance with pipelining 46 Dept. of Computer Architecture, UMA, Oct 2014
Control Hazards
Dependencies backward in time cause control
hazards in branch instructions
ALU
I beq IM Reg DM Reg
n
s
ALU
t lw IM Reg DM Reg
r.
ALU
O Inst 3 IM Reg DM Reg
r
d
ALU
e Inst 4 IM Reg DM Reg
r
Chapter 2. Enhancing performance with pipelining 47 Dept. of Computer Architecture, UMA, Oct 2014
Other Pipeline Structures Are Possible
What about the (slow) multiply operation?
Make the clock twice as slow or …
let it take two cycles (since it doesn’t use the DM stage)
MUL
ALU
IM Reg DM Reg
Chapter 2. Enhancing performance with pipelining 48 Dept. of Computer Architecture, UMA, Oct 2014
Other Sample Pipeline Alternatives
ARM7
IM Reg EX
XScale
ALU
IM1 IM2 Reg DM1 Reg
SHFT DM2
PC update decode DM write
BTB access reg 1 access ALU op reg write
start IM access
shift/rotate start DM access
IM access reg 2 access exception
Chapter 2. Enhancing performance with pipelining 49 Dept. of Computer Architecture, UMA, Oct 2014
Summary
All modern day processors use pipelining
Pipelining doesn’t help latency of single task, it helps
throughput of entire workload
Potential speedup: a CPI of 1 and fast a CC
Pipeline rate limited by slowest pipeline stage
Unbalanced pipe stages makes for inefficiencies
The time to “fill” pipeline and time to “drain” it can impact
speedup for deep pipelines and short code runs
Must detect and resolve hazards
Stalling negatively affects CPI (makes CPI less than the ideal
of 1)
Chapter 2. Enhancing performance with pipelining 50 Dept. of Computer Architecture, UMA, Oct 2014
Review: Data Hazards
Read before write data hazard
ALU
add $1, IM Reg DM Reg
ALU
sub $4,$1,$5 IM Reg DM Reg
ALU
and $6,$1,$7 IM Reg DM Reg
ALU
or $8,$1,$9 IM Reg DM Reg
ALU
xor $4,$1,$5 IM Reg DM Reg
Chapter 2. Enhancing performance with pipelining 51 Dept. of Computer Architecture, UMA, Oct 2014
One Way to “Fix” a Data Hazard: Detention
ALU
I add $1, IM Reg DM Reg
waiting – stall –
n
but impacts CPI
s
t stall
r.
O stall
r
d
ALU
e sub $4,$1,$5 IM Reg DM Reg
r
ALU
and $6,$1,$7 IM Reg DM Reg
Chapter 2. Enhancing performance with pipelining 52 Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards: Detention
Chapter 2. Enhancing performance with pipelining 53 Dept. of Computer Architecture, UMA, Oct 2014
Another Way to “Fix” a Data Hazard: Forwarding
Fix data hazards
by forwarding
ALU
add $1, IM Reg DM Reg
I results as soon as
n they are available
s to where they are
ALU
t IM Reg DM Reg
sub $4,$1,$5 needed
r.
ALU
O IM Reg DM Reg
r and $6,$1,$7
d
e
ALU
r IM Reg DM Reg
or $8,$1,$9
ALU
IM Reg DM Reg
xor $4,$1,$5
Chapter 2. Enhancing performance with pipelining 54 Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards: Forwarding (aka Bypassing)
Chapter 2. Enhancing performance with pipelining 55 Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards: Forwarding (aka Bypassing)
Take the result from the earliest point that it exists in any
of the pipeline state registers and forward it to the
functional units (e.g., the ALU) that need it that cycle
For ALU functional unit: the inputs can come from any
pipeline register rather than just from ID/EX by
adding multiplexors to the inputs of the ALU
connecting the Rd write data in EX/MEM or MEM/WB to either (or
both) of the EX’s stage Rs and Rt ALU mux inputs
adding the proper control hardware to control the new muxes
Other functional units may need similar forwarding logic
(e.g., the DM)
With forwarding can achieve a CPI of 1 even in the
presence of data dependencies
Chapter 2. Enhancing performance with pipelining 56 Dept. of Computer Architecture, UMA, Oct 2014
Forwarding Illustration
ALU
I add $1, IM Reg DM Reg
n
s
ALU
t sub $4,$1,$5 IM Reg DM Reg
r.
ALU
IM Reg DM Reg
r and $6,$7,$1
d
e
r
Chapter 2. Enhancing performance with pipelining 57 Dept. of Computer Architecture, UMA, Oct 2014
Data Forwarding Control Conditions
1. EX Forward Unit:
if (EX/MEM.RegWrite Forwards the
and (EX/MEM.RegisterRd != 0)
result from the
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10 previous instr.
if (EX/MEM.RegWrite to either input
and (EX/MEM.RegisterRd != 0) of the ALU
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
ALU
add $1,$1,$2 IM Reg DM Reg
n
s
t
r. add $1,$1,$3
ALU
IM Reg DM Reg
O
r
ALU
d add $1,$1,$4 IM Reg DM Reg
e
r
Chapter 2. Enhancing performance with pipelining 59 Dept. of Computer Architecture, UMA, Oct 2014
Corrected Data Forwarding Control Conditions
1. EX Forward Unit:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0) Forwards the
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) result from the
ForwardA = 10
previous instr.
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0) to either input
and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) of the ALU
ForwardB = 10
2. MEM Forward Unit:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRs) Forwards the
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) result from the
ForwardA = 01 previous or
second
if (MEM/WB.RegWrite
previous instr.
and (MEM/WB.RegisterRd != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRt) to either input
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) of the ALU
ForwardB = 01
Chapter 2. Enhancing performance with pipelining Dept. of Computer Architecture, UMA, Oct 2014
60
Datapath with Forwarding Hardware
PCSrc
ID/EX
EX/MEM
Control
IF/ID
Add
Branch MEM/WB
Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
EX/MEM.RegisterRd
ID/EX.RegisterRt
Forward MEM/WB.RegisterRd
ID/EX.RegisterRs Unit
Chapter 2. Enhancing performance with pipelining 61 Dept. of Computer Architecture, UMA, Oct 2014
Load-Use Data Hazard
Chapter 2. Enhancing performance with pipelining 62 Dept. of Computer Architecture, UMA, Oct 2014
Forwarding with Load-use Data Hazards
ALU
I lw $1,4($2)IM Reg DM Reg
n
s
ALU
sub $4,$1,$5 IM Reg DM Reg
t
r.
ALU
IM Reg DM Reg
O and $6,$1,$7
r
ALU
d IM Reg DM Reg
e or $8,$1,$9
r
ALU
xor $4,$1,$5 IM Reg DM Reg
ALU
IM Reg DM
Chapter 2. Enhancing performance with pipelining 63 Dept. of Computer Architecture, UMA, Oct 2014
Forwarding with Load-use Data Hazards
ALU
I lw $1,4($2)IM Reg DM Reg
n
s
ALU
IM Reg DM Reg
t stall $4,$1,$5
sub
r.
ALU
and $4,$1,$5
sub $6,$1,$7 IM Reg DM Reg
O
r
ALU
d or $6,$1,$7
and $8,$1,$9 IM Reg DM Reg
e
r
ALU
or
xor $8,$1,$9
$4,$1,$5 IM Reg DM Reg
ALU
xor $4,$1,$5 IM Reg DM
ID/EX.MemRead
Hazard ID/EX
Unit EX/MEM
0
IF/ID 1
Control 0
Add
Branch MEM/WB
Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
Forward
Unit
ID/EX.RegisterRt
Chapter 2. Enhancing performance with pipelining 67 Dept. of Computer Architecture, UMA, Oct 2014
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next
instruction
C code for A = B + E; C = B + F;
Chapter 2. Enhancing performance with pipelining 68 Dept. of Computer Architecture, UMA, Oct 2014
Control Hazards
When the flow of instruction addresses is not sequential
(i.e., PC = PC + 4); incurred by change of flow instructions
Unconditional branches (j, jal, jr)
Conditional branches (beq, bne)
Exceptions
Possible approaches
Stall (impacts CPI)
Move decision point as early in the pipeline as possible, thereby
reducing the number of stall cycles
Delay decision (requires compiler support)
Predict and hope for the best !
Shift ID/EX
EX/MEM
left 2
IF/ID Control
Add
Branch MEM/WB
PC+4[31-28] Add
4 Shift
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
Forward
Unit
Chapter 2. Enhancing performance with pipelining 70 Dept. of Computer Architecture, UMA, Oct 2014
Jumps Incur One Stall
Jumps not decoded until ID, so one flush is needed
To flush, set IF.Flush to zero the instruction field of the
IF/ID pipeline register (turning it into a noop)
Fix jump
ALU
I j IM Reg DM Reg
hazard by
n
waiting –
s
flush
ALU
t flush IM Reg DM Reg
r.
ALU
O IM Reg DM Reg
j target
r
d
e
r
Chapter 2. Enhancing performance with pipelining 72 Dept. of Computer Architecture, UMA, Oct 2014
Supporting ID Stage Jumps
Jump
PCSrc
Shift ID/EX
EX/MEM
left 2
IF/ID Control
Add
Branch MEM/WB
PC+4[31-28] Add
4 Shift
left 2
Read Addr 1
Instruction Register Read Data
Memory Read Addr 2Data 1 Memory
Read 0
PC
Forward
Unit
Chapter 2. Enhancing performance with pipelining 73 Dept. of Computer Architecture, UMA, Oct 2014
One Way to “Fix” a Control Hazard: Detention
Fix branch
ALU
I beq IM Reg DM Reg hazard by
n waiting –
s flush – but
ALU
t flush IM Reg DM Reg
affects CPI
r.
ALU
IM Reg DM Reg
O flush
r
ALU
d IM Reg DM Reg
e flush
r
ALU
IM Reg DM Reg
beq target
ALU
IM Reg DM
Inst 3
Chapter 2. Enhancing performance with pipelining 75 Dept. of Computer Architecture, UMA, Oct 2014
Another Way to “Fix” Control Hazard
Move branch decision hardware back to as early in
the pipeline as possible – i.e., during the decode cycle
ALU
beq IM Reg DM Reg Fix branch
I
n hazard by
s waiting –
ALU
t flush IM Reg DM Reg flush
r.
ALU
O IM Reg DM Reg
r beq target
d
ALU
e IM Reg DM
r Inst 3
Chapter 2. Enhancing performance with pipelining 76 Dept. of Computer Architecture, UMA, Oct 2014
Reducing the Delay of Branches
Move the branch decision hardware back to the EX stage
Reduces the number of stall (flush) cycles to two
Adds an and gate and a 2x1 mux to the EX timing path
Hazard ID/EX
Unit EX/MEM
0 1
IF/ID Control 0
Add
Shift MEM/WB
4 Add
Compare
IF.Flush
left 2
Read Addr 1
Instruction RegFile Data
Memory Read Addr 2 Memory
Read 0
PC
Forward
Unit
Forward
Unit
Chapter 2. Enhancing performance with pipelining 80 Dept. of Computer Architecture, UMA, Oct 2014
Delayed Branches
If the branch hardware has been moved to the ID stage,
then we can eliminate all branch stalls with delayed
branches which are defined as always executing the next
sequential instruction after the branch instruction – the
branch takes effect after that next instruction
MIPS compiler moves an instruction to immediately after the
branch that is not affected by the branch (a safe instruction)
thereby hiding the branch delay
Chapter 2. Enhancing performance with pipelining 81 Dept. of Computer Architecture, UMA, Oct 2014
Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then
delay slot sub $4,$5,$6
Chapter 2. Enhancing performance with pipelining 84 Dept. of Computer Architecture, UMA, Oct 2014
Flushing with Misprediction (Not Taken)
ALU
IM Reg DM Reg
I 4 beq $1,$2,2
n
s
ALU
8 flush
sub $4,$1,$5 IM Reg DM Reg
t
r.
ALU
16 and $6,$1,$7 IM Reg DM Reg
O
r
d
ALU
20 or r8,$1,$9 IM Reg DM Reg
e
r
Chapter 2. Enhancing performance with pipelining 86 Dept. of Computer Architecture, UMA, Oct 2014
Static Branch Prediction, con’t
Resolve branch hazards by assuming a given outcome
and proceeding
Chapter 2. Enhancing performance with pipelining 88 Dept. of Computer Architecture, UMA, Oct 2014
Branch Target Buffer
The BHT predicts when a branch is taken, but does not
tell where its taken to!
A branch target buffer (BTB) in the IF stage caches the branch
target address, but we also need to fetch the next sequential
instruction. The prediction bit in IF/ID selects which “next”
instruction will be loaded into IF/ID at the next clock edge
- Would need a two read port
instruction memory
BTB
PC
next sequential instruction Address
Chapter 2. Enhancing performance with pipelining 92 Dept. of Computer Architecture, UMA, Oct 2014