0% found this document useful (0 votes)
21 views88 pages

EC Chapter2 2014

1) The document describes the basic operation of a MIPS processor implementation including fetch, decode, execute stages. 2) Key steps are fetching instructions from memory and incrementing the program counter, decoding the instruction to read registers and determine the operation, and executing different instruction types by performing arithmetic/logical operations, memory access, or branch comparisons. 3) Control signals are needed to determine when to write register values or memory to handle cases where write back does not occur every clock cycle.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views88 pages

EC Chapter2 2014

1) The document describes the basic operation of a MIPS processor implementation including fetch, decode, execute stages. 2) Key steps are fetching instructions from memory and incrementing the program counter, decoding the instruction to read registers and determine the operation, and executing different instruction types by performing arithmetic/logical operations, memory access, or branch comparisons. 3) Control signals are needed to determine when to write register values or memory to handle cases where write back does not occur every clock cycle.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Computer Architecture 14-15

Chapter 2: Enhancing performance


with pipelining

[Adapted from Computer Organization and Design, 4th Edition,


Patterson & Hennessy, © 2009, MK]

Chapter 2. Enhancing performance with pipelining 1 Dept. of Computer Architecture, UMA, Oct 2014
Introduction
CPU performance factors
Instruction count
- Determined by ISA and compiler
CPI and Cycle time
- Determined by CPU hardware

We will examine two MIPS implementations


A simplified version
A more realistic pipelined version
Simple subset, shows most aspects
Memory reference: lw, sw
Arithmetic/logical: add, sub, and, or, slt
Control transfer: beq, j
Chapter 2. Enhancing performance with pipelining 2 Dept. of Computer Architecture, UMA, Oct 2014
Review: MIPS (RISC) Design Principles
Simplicity favors regularity
fixed size instructions
small number of instruction formats
opcode always the first 6 bits

Smaller is faster
limited instruction set
limited number of registers in register file
limited number of addressing modes

Make the common case fast


arithmetic operands from the register file (load-store machine)
allow instructions to contain immediate operands

Good design demands good compromises


three instruction formats
Chapter 2. Enhancing performance with pipelining 3 Dept. of Computer Architecture, UMA, Oct 2014
The Processor: Datapath & Control
Our implementation of the MIPS is simplified
memory-reference instructions: lw, sw
arithmetic-logical instructions: add, sub, and, or, slt
control flow instructions: beq, j

Generic implementation
use the program counter (PC) to supply Fetch
PC = PC+4
the instruction address and fetch the
instruction from memory (and update the PC) Exec Decode
decode the instruction (and read registers)
execute the instruction

All instructions (except j) use the ALU after reading the


registers

How? memory-reference? arithmetic? control flow?


Chapter 2. Enhancing performance with pipelining 4 Dept. of Computer Architecture, UMA, Oct 2014
Aside: Clocking Methodologies
The clocking methodology defines when data in a state
element is valid and stable relative to the clock
State elements - a memory element such as a register
Edge-triggered – all state changes occur on a clock edge
Typical execution
read contents of state elements -> send values through
combinational logic -> write results to one or more state elements
State Combinational State
element element
logic
1 2

clock

one clock cycle


Assumes state elements are written on every clock
cycle; if not, need explicit write control signal
write occurs only when both the write control is asserted and the
clock edge occurs
Chapter 2. Enhancing performance with pipelining 5 Dept. of Computer Architecture, UMA, Oct 2014
Fetching Instructions
Fetching instructions involves
reading the instruction from the Instruction Memory
updating the PC value to be the address of the next
(sequential) instruction

clock Add

4
Fetch
PC = PC+4 Instruction
Memory
Exec Decode Read
PC Instruction
Address

PC is updated every clock cycle, so it does not need an


explicit write control signal just a clock signal
Reading from the Instruction Memory is a combinational
activity, so it doesn’t need an explicit read control signal
Chapter 2. Enhancing performance with pipelining 6 Dept. of Computer Architecture, UMA, Oct 2014
Decoding Instructions
Decoding instructions involves
sending the fetched instruction’s opcode and function field
bits to the control unit

Fetch Control
PC = PC+4 Unit

Exec Decode
Read Addr 1
Register Read
Read Addr 2 Data 1
and Instruction
File
Write Addr
Read
Data 2
Write Data

reading two values from the Register File


- Register File addresses are contained in the instruction

Chapter 2. Enhancing performance with pipelining 7 Dept. of Computer Architecture, UMA, Oct 2014
Executing R Format Operations
R format operations (add, sub, slt, and, or)
31 25 20 15 10 5 0
R-type: op rs rt rd shamt funct
perform operation (op and funct) on values in rs and rt
store the result back into the Register File (into location rd)

RegWrite ALU control

Read Addr 1
Fetch Register Read
PC = PC+4 Instruction Read Addr 2 Data 1 overflow
File zero
ALU
Write Addr
Exec Decode Read
Data 2
Write Data

Note that Register File is not written every cycle (e.g. sw), so
we need an explicit write control signal for the Register File
Chapter 2. Enhancing performance with pipelining 8 Dept. of Computer Architecture, UMA, Oct 2014
Executing Load and Store Operations
Load and store operations involves
compute memory address by adding the base register (read from
the Register File during decode) to the 16-bit signed-extended
offset field in the instruction
store value (read from the Register File during decode) written to
the Data Memory
load value, read from the Data Memory, written to the Register
File RegWrite ALU control MemWrite

overflow
Read Addr 1 zero
Register Read
Address

Instruction Read Addr 2 Data 1 Data


File Memory Read Data
ALU
Write Addr
Read
Data 2 Write Data
Write Data

Sign MemRead
16 Extend 32

Chapter 2. Enhancing performance with pipelining 9 Dept. of Computer Architecture, UMA, Oct 2014
Executing Branch Operations
Branch operations involves
compare the operands read from the Register File during decode
for equality (zero ALU output)
compute the branch target address by adding the updated PC to
the 16-bit signed-extended offset field in the instr
Add Branch
Add target
4 Shift address
left 2

ALU control
PC

Read Addr 1 zero (to branch


Register Read control logic)
Read Addr 2 Data 1
Instruction
File
ALU
Write Addr Read
Data 2
Write Data

Sign
16 Extend 32
Chapter 2. Enhancing performance with pipelining 10 Dept. of Computer Architecture, UMA, Oct 2014
Executing Jump Operations
Jump operation involves
replace the lower 28 bits of the PC with the lower 26 bits of the
fetched instruction shifted left by 2 bits

Add

4
4
Jump
Instruction Shift address
Memory
left 2 28
PC Read Instruction
Address 26

Chapter 2. Enhancing performance with pipelining 11 Dept. of Computer Architecture, UMA, Oct 2014
Creating a Single Datapath from the Parts
Assemble the datapath segments and add control lines
and multiplexors as needed
Single cycle design – fetch, decode and execute each
instructions in one clock cycle
no datapath resource can be used more than once per
instruction, so some must be duplicated (e.g., separate
Instruction Memory and Data Memory, several adders)
multiplexors needed at the input of shared elements with
control lines to do the selection
write signals to control writing to the Register File and Data
Memory

Cycle time is determined by length of the longest path

Chapter 2. Enhancing performance with pipelining 12 Dept. of Computer Architecture, UMA, Oct 2014
Fetch, R, and Memory Access Portions

Add
RegWrite ALUSrc ALU control MemWrite MemtoReg
4
ovf
zero
Read Addr 1
Instruction
Memory Register Read Address
Read Addr 2 Data 1
Read Data
PC Instruction File Memory Read Data
Address ALU
Write Addr Read
Data 2 Write Data
Write Data

MemRead
Sign
16 Extend 32

Chapter 2. Enhancing performance with pipelining 13 Dept. of Computer Architecture, UMA, Oct 2014
Adding the Control
Selecting the operations to perform (ALU, Register File
and Memory read/write)
Controlling the flow of data (multiplexor inputs)
31 25 20 15 10 5 0
R-type: op rs rt rd shamt funct
Observations 31 25 20 15 0
I-Type: op rs rt address offset
op field always
in bits 31-26 31 25 0
addr of registers J-type: op target address
to be read are
always specified by the
rs field (bits 25-21) and rt field (bits 20-16); for lw and sw rs is the base
register
addr. of register to be written is in one of two places – in rt (bits 20-16)
for lw; in rd (bits 15-11) for R-type instructions
offset for beq, lw, and sw always in bits 15-0
Chapter 2. Enhancing performance with pipelining 14 Dept. of Computer Architecture, UMA, Oct 2014
Single Cycle Datapath with Control Unit
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]

Chapter 2. Enhancing performance with pipelining 15 Dept. of Computer Architecture, UMA, Oct 2014
R-type Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]

Chapter 2. Enhancing performance with pipelining 16 Dept. of Computer Architecture, UMA, Oct 2014
Load Word Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]

Chapter 2. Enhancing performance with pipelining 17 Dept. of Computer Architecture, UMA, Oct 2014
Branch Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]

Chapter 2. Enhancing performance with pipelining 19 Dept. of Computer Architecture, UMA, Oct 2014
Adding the Jump Operation
Instr[25-0] 1
Shift
28 32
26 left 2
PC+4[31-28] 0
Add 0
Add 1
4 Shift
left 2 PCSrc
Jump
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Memory Register Read Address
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Instr[31-0] 0 File Memory Read Data 1
Address ALU
Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign
ALU
16 Extend 32 control
Instr[5-0]

Chapter 2. Enhancing performance with pipelining 21 Dept. of Computer Architecture, UMA, Oct 2014
Instruction Times (Critical Paths)
What is the clock cycle time assuming negligible
delays for muxes, control unit, sign extend, PC access,
shift left 2, wires, setup and hold times except:
Instruction and Data Memory (200 ps)
ALU and adders (200 ps)
Register File access (reads or writes) (100 ps)

Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total


R-
type
load
store
beq
jump
Chapter 2. Enhancing performance with pipelining 22 Dept. of Computer Architecture, UMA, Oct 2014
Single Cycle Disadvantages & Advantages
Uses the clock cycle inefficiently – the clock cycle must
be timed to accommodate the slowest instruction
especially problematic for more complex instructions like
floating point multiply

Cycle 1 Cycle 2
Clk

lw sw Waste

May be wasteful of area since some functional units


(e.g., adders) must be duplicated since they can not be
shared during a clock cycle
but
Is simple and easy to understand
Chapter 2. Enhancing performance with pipelining 24 Dept. of Computer Architecture, UMA, Oct 2014
How Can We Make It Faster?
Start fetching and executing the next instruction before
the current one has completed
Pipelining – (all?) modern processors are pipelined for
performance
Remember the performance equation:
CPU time = CPI * IC * ClockCycleTime

Under ideal conditions and with a large number of


instructions, the speedup from pipelining is
approximately equal to the number of pipe stages
A five stage pipeline is nearly five times faster because the
ClockCycleTime is nearly five times faster

Fetch (and execute) more than one instruction at a time


Superscalar processing – stay tuned

Chapter 2. Enhancing performance with pipelining 25 Dept. of Computer Architecture, UMA, Oct 2014
Pipelining Analogy

Pipelined laundry: overlapping execution


Parallelism improves performance

Four loads:
Speedup
= 8/3.5 = 2.3
Non-stop:
Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages

Chapter 2. Enhancing performance with pipelining 26 Dept. of Computer Architecture, UMA, Oct 2014
The Five Stages of Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

lw IFetch Dec Exec Mem WB

IFetch: Instruction Fetch and Update PC


Dec: Registers Fetch and Instruction Decode
Exec: Execute R-type; calculate memory address
Mem: Read/write the data from/to the Data Memory
WB: Write the result data into the register file

Chapter 2. Enhancing performance with pipelining 27 Dept. of Computer Architecture, UMA, Oct 2014
A Pipelined MIPS Processor
Start the next instruction before the current one has
completed
improves throughput - total amount of work done in a given time
instruction latency (execution time, delay time, response time -
time from the start of an instruction to its completion) is not
reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

lw IFetch Dec Exec Mem WB

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage


- for some stages don’t need the whole clock cycle (e.g., WB)
- for some instructions, some stages are wasted cycles (i.e.,
nothing is done during that cycle for that instruction)
Chapter 2. Enhancing performance with pipelining 28 Dept. of Computer Architecture, UMA, Oct 2014
Pipeline Performance

Assume time for stages is


100ps for register read or write
200ps for other stages

Compare pipelined datapath with single-cycle


datapath
Instr Instr fetch Register ALU op Memory Register Total time
read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps

Chapter 2. Enhancing performance with pipelining 29 Dept. of Computer Architecture, UMA, Oct 2014
Pipeline Performance

Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

Chapter 2. Enhancing performance with pipelining 30 Dept. of Computer Architecture, UMA, Oct 2014
Pipeline Performance
Single Cycle Implementation (CC = 800 ps):
Cycle 1 Cycle 2
Clk

lw sw Waste

Pipeline Implementation (CC = 200 ps): 400 ps


lw IFetch Dec Exec Mem WB

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

To complete an entire instruction in the pipelined case


takes 1000 ps (as compared to 800 ps for the single
cycle case). Why ?
How long does each take to complete 1,000,000 adds ?
Chapter 2. Enhancing performance with pipelining 31 Dept. of Computer Architecture, UMA, Oct 2014
Pipeline Speedup
If all stages are balanced
i.e., all take the same time

Time between instructionspipelined =


Time between instructionsnonpipelined
Number of stages

If not balanced, speedup is less


Speedup due to increased throughput
Latency (time for each instruction) does not decrease

Chapter 2. Enhancing performance with pipelining 32 Dept. of Computer Architecture, UMA, Oct 2014
Pipelining the MIPS ISA
What makes it easy
all instructions are the same length (32 bits)
- can fetch in the 1st stage and decode in the 2nd stage
few instruction formats (three) with symmetry across formats
- can begin reading register file in 2nd stage
memory operations occur only in loads and stores
- can use the execute stage to calculate memory addresses
each instruction writes at most one result (i.e., changes the
machine state) and does it in the last few pipeline stages (MEM
or WB)
operands must be aligned in memory so a single data transfer
takes only one data memory access

Chapter 2. Enhancing performance with pipelining 33 Dept. of Computer Architecture, UMA, Oct 2014
MIPS Pipeline Datapath Additions/Mods
State registers between each pipeline stage to isolate them
IF:IFetch ID:Dec EX:Execute MEM: WB:
MemAccess WriteBack

IF/ID ID/EX EX/MEM

Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC

File Address
Read
Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data

Sign
16 Extend 32

System Clock
Chapter 2. Enhancing performance with pipelining 34 Dept. of Computer Architecture, UMA, Oct 2014
MIPS Pipeline Control Path Modifications
All control signals can be determined during Decode
and held in the state registers between pipeline stages
PCSrc
ID/EX
EX/MEM
Control
IF/ID

Add
Branch MEM/WB
RegWrite Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1 MemtoReg
Read ALUSrc
PC

File Address
Read
Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
ALU
cntrl
MemRead
Sign
16 Extend 32 ALUOp

RegDst
Chapter 2. Enhancing performance with pipelining 35 Dept. of Computer Architecture, UMA, Oct 2014
Pipeline Control
IF Stage: read Instr Memory (always asserted) and write
PC (on System Clock)
ID Stage: no optional control signals to set

EX Stage MEM Stage WB Stage


Reg ALU ALU ALU Brch Mem Mem Reg Mem
Dst Op1 Op0 Src Read Write Write toReg
R 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X

Chapter 2. Enhancing performance with pipelining 36 Dept. of Computer Architecture, UMA, Oct 2014
Graphically Representing MIPS Pipeline

ALU
IM Reg DM Reg

Can help with answering questions like:


How many cycles does it take to execute this code?
What is the ALU doing during cycle 4?
Is there a hazard, why does it occur, and how can it be fixed?

Chapter 2. Enhancing performance with pipelining 37 Dept. of Computer Architecture, UMA, Oct 2014
Why Pipeline? For Performance!
Time (clock cycles)

Once the

ALU
I Inst 0 IM Reg DM Reg pipeline is full,
n one instruction
s is completed

ALU
t Inst 1 IM Reg DM Reg
every cycle, so
r. CPI = 1

ALU
O Inst 2 IM Reg DM Reg
r
d

ALU
e Inst 3 IM Reg DM Reg
r

ALU
Inst 4 IM Reg DM Reg

Time to fill the pipeline

Chapter 2. Enhancing performance with pipelining 38 Dept. of Computer Architecture, UMA, Oct 2014
Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazards
structural hazards: attempt to use the same resource by two
different instructions at the same time
data hazards: attempt to use data before it is ready
- An instruction’s source operand(s) are produced by a prior
instruction still in the pipeline
control hazards: attempt to make a decision about program
control flow before the condition has been evaluated and the
new PC target address calculated
- branch and jump instructions, exceptions

Can usually resolve hazards by waiting


pipeline control must detect the hazard
and take action to resolve hazards
Chapter 2. Enhancing performance with pipelining 39 Dept. of Computer Architecture, UMA, Oct 2014
Structure Hazards
Conflict for use of a resource
In MIPS pipeline with a single memory
Load/store requires data access
Instruction fetch would have to stall for that cycle
- Would cause a pipeline “bubble”

Hence, pipelined datapaths require separate


instruction/data memories
Or separate instruction/data caches

Chapter 2. Enhancing performance with pipelining 40 Dept. of Computer Architecture, UMA, Oct 2014
A Single Memory Would Be a Structural Hazard
Time (clock cycles)

Reading data from

ALU
I lw Mem Reg Mem Reg
memory
n
s

ALU
t Inst 1 Mem Reg Mem Reg
r.

ALU
O Inst 2 Mem Reg Mem Reg
r
d

ALU
e Inst 3 Mem Reg Mem Reg
r

ALU
Inst 4 Mem Reg Mem Reg
Reading instruction
from memory

Fix with separate instr and data memories (I$ and D$)
Chapter 2. Enhancing performance with pipelining 41 Dept. of Computer Architecture, UMA, Oct 2014
How About Register File Access?
Time (clock cycles)

Fix register file

ALU
I add $1, IM Reg DM Reg access hazard by
n doing reads in the
s second half of the

ALU
t Inst 1 IM Reg DM Reg
cycle and writes in
r. the first half

ALU
O Inst 2 IM Reg DM Reg
r
d

ALU
e add $2,$1, IM Reg DM Reg
r

clock edge that controls clock edge that controls


register writing loading of pipeline state
Chapter 2. Enhancing performance with pipelining 42
registers
Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards
Dependencies backward in time cause hazards

ALU
I add $1, IM Reg DM Reg
n
s

ALU
t sub $4,$1,$5 IM Reg DM Reg
r.

ALU
O and $6,$1,$7 IM Reg DM Reg
r
d

ALU
e or $8,$1,$9 IM Reg DM Reg
r

ALU
xor $4,$1,$5 IM Reg DM Reg

Read before write data hazard


Chapter 2. Enhancing performance with pipelining 43 Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards (R inst.)
Dependencies backward in time cause hazards

ALU
add $1, IM Reg DM Reg

ALU
sub $4,$1,$5 IM Reg DM Reg

ALU
and $6,$1,$7 IM Reg DM Reg

ALU
or $8,$1,$9 IM Reg DM Reg

ALU
xor $4,$1,$5 IM Reg DM Reg

Read before write data hazard


Chapter 2. Enhancing performance with pipelining 44 Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards (loads)
Dependencies backward in time cause hazards

ALU
I lw $1,4($2) IM Reg DM Reg
n
s

ALU
t sub $4,$1,$5 IM Reg DM Reg
r.

ALU
O and $6,$1,$7 IM Reg DM Reg
r
d

ALU
e or $8,$1,$9 IM Reg DM Reg
r

ALU
xor $4,$1,$5 IM Reg DM Reg

Load-use data hazard


Chapter 2. Enhancing performance with pipelining 45 Dept. of Computer Architecture, UMA, Oct 2014
Control Hazards
Branch determines flow of control
Fetching next instruction depends on branch outcome
Pipeline can’t always fetch correct instruction
- Still working on ID stage of branch

In MIPS pipeline
Need to compare registers and compute target early in the
pipeline
Add hardware to do it in ID stage

Chapter 2. Enhancing performance with pipelining 46 Dept. of Computer Architecture, UMA, Oct 2014
Control Hazards
Dependencies backward in time cause control
hazards in branch instructions

ALU
I beq IM Reg DM Reg
n
s

ALU
t lw IM Reg DM Reg
r.

ALU
O Inst 3 IM Reg DM Reg
r
d

ALU
e Inst 4 IM Reg DM Reg
r

Chapter 2. Enhancing performance with pipelining 47 Dept. of Computer Architecture, UMA, Oct 2014
Other Pipeline Structures Are Possible
What about the (slow) multiply operation?
Make the clock twice as slow or …
let it take two cycles (since it doesn’t use the DM stage)
MUL

ALU
IM Reg DM Reg

What if the data memory access is twice as slow as


the instruction memory?
make the clock twice as slow or …
let data memory access take two cycles (and keep the same
clock rate) ALU
IM Reg DM1 DM2 Reg

Chapter 2. Enhancing performance with pipelining 48 Dept. of Computer Architecture, UMA, Oct 2014
Other Sample Pipeline Alternatives

ARM7
IM Reg EX

PC update decode ALU op


IM access reg DM access
access shift/rotate
commit result
(write back)

XScale

ALU
IM1 IM2 Reg DM1 Reg
SHFT DM2
PC update decode DM write
BTB access reg 1 access ALU op reg write
start IM access
shift/rotate start DM access
IM access reg 2 access exception

Chapter 2. Enhancing performance with pipelining 49 Dept. of Computer Architecture, UMA, Oct 2014
Summary
All modern day processors use pipelining
Pipelining doesn’t help latency of single task, it helps
throughput of entire workload
Potential speedup: a CPI of 1 and fast a CC
Pipeline rate limited by slowest pipeline stage
Unbalanced pipe stages makes for inefficiencies
The time to “fill” pipeline and time to “drain” it can impact
speedup for deep pipelines and short code runs
Must detect and resolve hazards
Stalling negatively affects CPI (makes CPI less than the ideal
of 1)

Chapter 2. Enhancing performance with pipelining 50 Dept. of Computer Architecture, UMA, Oct 2014
Review: Data Hazards
Read before write data hazard

Value of $1 10 10 10 10 10/-20 -20 -20 -20 -20

ALU
add $1, IM Reg DM Reg

ALU
sub $4,$1,$5 IM Reg DM Reg

ALU
and $6,$1,$7 IM Reg DM Reg

ALU
or $8,$1,$9 IM Reg DM Reg

ALU
xor $4,$1,$5 IM Reg DM Reg

Chapter 2. Enhancing performance with pipelining 51 Dept. of Computer Architecture, UMA, Oct 2014
One Way to “Fix” a Data Hazard: Detention

Can fix data


hazard by

ALU
I add $1, IM Reg DM Reg
waiting – stall –
n
but impacts CPI
s
t stall
r.

O stall
r
d

ALU
e sub $4,$1,$5 IM Reg DM Reg
r

ALU
and $6,$1,$7 IM Reg DM Reg

Chapter 2. Enhancing performance with pipelining 52 Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards: Detention

An instruction depends on completion of data access by a


previous instruction
add $s0, $t0, $t1
sub $t2, $s0, $t3

Chapter 2. Enhancing performance with pipelining 53 Dept. of Computer Architecture, UMA, Oct 2014
Another Way to “Fix” a Data Hazard: Forwarding
Fix data hazards
by forwarding

ALU
add $1, IM Reg DM Reg
I results as soon as
n they are available
s to where they are

ALU
t IM Reg DM Reg
sub $4,$1,$5 needed
r.

ALU
O IM Reg DM Reg
r and $6,$1,$7
d
e

ALU
r IM Reg DM Reg
or $8,$1,$9

ALU
IM Reg DM Reg
xor $4,$1,$5

Chapter 2. Enhancing performance with pipelining 54 Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards: Forwarding (aka Bypassing)

Use result when it is computed


Don’t wait for it to be stored in a register
Requires extra connections in the datapath

Chapter 2. Enhancing performance with pipelining 55 Dept. of Computer Architecture, UMA, Oct 2014
Data Hazards: Forwarding (aka Bypassing)
Take the result from the earliest point that it exists in any
of the pipeline state registers and forward it to the
functional units (e.g., the ALU) that need it that cycle
For ALU functional unit: the inputs can come from any
pipeline register rather than just from ID/EX by
adding multiplexors to the inputs of the ALU
connecting the Rd write data in EX/MEM or MEM/WB to either (or
both) of the EX’s stage Rs and Rt ALU mux inputs
adding the proper control hardware to control the new muxes
Other functional units may need similar forwarding logic
(e.g., the DM)
With forwarding can achieve a CPI of 1 even in the
presence of data dependencies

Chapter 2. Enhancing performance with pipelining 56 Dept. of Computer Architecture, UMA, Oct 2014
Forwarding Illustration

ALU
I add $1, IM Reg DM Reg
n
s

ALU
t sub $4,$1,$5 IM Reg DM Reg
r.

ALU
IM Reg DM Reg
r and $6,$7,$1
d
e
r

EX forwarding MEM forwarding

Chapter 2. Enhancing performance with pipelining 57 Dept. of Computer Architecture, UMA, Oct 2014
Data Forwarding Control Conditions
1. EX Forward Unit:
if (EX/MEM.RegWrite Forwards the
and (EX/MEM.RegisterRd != 0)
result from the
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10 previous instr.
if (EX/MEM.RegWrite to either input
and (EX/MEM.RegisterRd != 0) of the ALU
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10

2. MEM Forward Unit:


if (MEM/WB.RegWrite Forwards the
and (MEM/WB.RegisterRd != 0)
result from the
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01 second
if (MEM/WB.RegWrite previous instr.
and (MEM/WB.RegisterRd != 0) to either input
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) of the ALU
ForwardB = 01
Chapter 2. Enhancing performance with pipelining 58 Dept. of Computer Architecture, UMA, Oct 2014
Yet Another Complication!
Another potential data hazard can occur when there is a
conflict between the result of the WB stage instruction
and the MEM stage instruction – which should be
forwarded?

ALU
add $1,$1,$2 IM Reg DM Reg
n
s
t
r. add $1,$1,$3

ALU
IM Reg DM Reg

O
r

ALU
d add $1,$1,$4 IM Reg DM Reg
e
r

Chapter 2. Enhancing performance with pipelining 59 Dept. of Computer Architecture, UMA, Oct 2014
Corrected Data Forwarding Control Conditions
1. EX Forward Unit:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0) Forwards the
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) result from the
ForwardA = 10
previous instr.
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0) to either input
and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) of the ALU
ForwardB = 10
2. MEM Forward Unit:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRs) Forwards the
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) result from the
ForwardA = 01 previous or
second
if (MEM/WB.RegWrite
previous instr.
and (MEM/WB.RegisterRd != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRt) to either input
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) of the ALU
ForwardB = 01
Chapter 2. Enhancing performance with pipelining Dept. of Computer Architecture, UMA, Oct 2014
60
Datapath with Forwarding Hardware
PCSrc

ID/EX
EX/MEM
Control
IF/ID

Add
Branch MEM/WB
Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC

File Address Read


Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
ALU
16 Sign 32 cntrl
Extend

EX/MEM.RegisterRd

ID/EX.RegisterRt
Forward MEM/WB.RegisterRd
ID/EX.RegisterRs Unit

Chapter 2. Enhancing performance with pipelining 61 Dept. of Computer Architecture, UMA, Oct 2014
Load-Use Data Hazard

Can’t always avoid stalls by forwarding


If value not computed when needed
Can’t forward backward in time!

Chapter 2. Enhancing performance with pipelining 62 Dept. of Computer Architecture, UMA, Oct 2014
Forwarding with Load-use Data Hazards

ALU
I lw $1,4($2)IM Reg DM Reg

n
s

ALU
sub $4,$1,$5 IM Reg DM Reg
t
r.

ALU
IM Reg DM Reg
O and $6,$1,$7
r

ALU
d IM Reg DM Reg
e or $8,$1,$9
r

ALU
xor $4,$1,$5 IM Reg DM Reg

ALU
IM Reg DM

Chapter 2. Enhancing performance with pipelining 63 Dept. of Computer Architecture, UMA, Oct 2014
Forwarding with Load-use Data Hazards

ALU
I lw $1,4($2)IM Reg DM Reg

n
s

ALU
IM Reg DM Reg
t stall $4,$1,$5
sub
r.

ALU
and $4,$1,$5
sub $6,$1,$7 IM Reg DM Reg
O
r

ALU
d or $6,$1,$7
and $8,$1,$9 IM Reg DM Reg
e
r

ALU
or
xor $8,$1,$9
$4,$1,$5 IM Reg DM Reg

ALU
xor $4,$1,$5 IM Reg DM

Will still need one stall cycle even with forwarding


Chapter 2. Enhancing performance with pipelining 64 Dept. of Computer Architecture, UMA, Oct 2014
Load-use Hazard Detection Unit
Need a Hazard detection Unit in the ID stage that inserts
a stall between the load and its use
1. ID Hazard detection Unit:
if (ID/EX.MemRead
and ((ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt)))
stall the pipeline

The first line tests to see if the instruction now in the EX


stage is a lw; the next two lines check to see if the
destination register of the lw matches either source
register of the instruction in the ID stage (the load-use
instruction)
After this one cycle stall, the forwarding logic can handle
the remaining data hazards
Chapter 2. Enhancing performance with pipelining 65 Dept. of Computer Architecture, UMA, Oct 2014
Hazard/Stall Hardware
Along with the Hazard Unit, we have to implement the stall
Prevent the instructions in the IF and ID stages from
progressing down the pipeline – done by preventing the
PC register and the IF/ID pipeline register from changing
Hazard detection Unit controls the writing of the PC (PC.write)
and IF/ID (IF/ID.write) registers
Insert a “bubble” between the lw instruction (in the EX
stage) and the load-use instruction (in the ID stage) (i.e.,
insert a noop in the execution stream)
Set the control bits in the EX, MEM, and WB control fields of the
ID/EX pipeline register to 0 (noop). The Hazard Unit controls the
mux that chooses between the real control values and the 0’s.
Let the lw instruction and the instructions after it in the
pipeline (before it in the code) proceed normally down the
pipeline
Chapter 2. Enhancing performance with pipelining 66 Dept. of Computer Architecture, UMA, Oct 2014
Adding the Hazard/Stall Hardware
PCSrc

ID/EX.MemRead
Hazard ID/EX
Unit EX/MEM
0
IF/ID 1
Control 0
Add
Branch MEM/WB
Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC

File Address Read


Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
ALU
16 Sign 32 cntrl
Extend

Forward
Unit
ID/EX.RegisterRt

Chapter 2. Enhancing performance with pipelining 67 Dept. of Computer Architecture, UMA, Oct 2014
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next
instruction
C code for A = B + E; C = B + F;

lw $t1, 0($t0) lw $t1, 0($t0)


lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles

Chapter 2. Enhancing performance with pipelining 68 Dept. of Computer Architecture, UMA, Oct 2014
Control Hazards
When the flow of instruction addresses is not sequential
(i.e., PC = PC + 4); incurred by change of flow instructions
Unconditional branches (j, jal, jr)
Conditional branches (beq, bne)
Exceptions

Possible approaches
Stall (impacts CPI)
Move decision point as early in the pipeline as possible, thereby
reducing the number of stall cycles
Delay decision (requires compiler support)
Predict and hope for the best !

Control hazards occur less frequently than data hazards,


but there is nothing as effective against control hazards as
forwarding is for data hazards
Chapter 2. Enhancing performance with pipelining 69 Dept. of Computer Architecture, UMA, Oct 2014
Datapath Branch and Jump Hardware
Jump
PCSrc

Shift ID/EX
EX/MEM
left 2

IF/ID Control

Add
Branch MEM/WB
PC+4[31-28] Add
4 Shift
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC

File Address Read


Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
ALU
16 Sign 32 cntrl
Extend

Forward
Unit

Chapter 2. Enhancing performance with pipelining 70 Dept. of Computer Architecture, UMA, Oct 2014
Jumps Incur One Stall
Jumps not decoded until ID, so one flush is needed
To flush, set IF.Flush to zero the instruction field of the
IF/ID pipeline register (turning it into a noop)

Fix jump

ALU
I j IM Reg DM Reg
hazard by
n
waiting –
s
flush

ALU
t flush IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
j target
r
d
e
r

Fortunately, jumps are very infrequent – only 3% of the


SPECint instruction mix
Chapter 2. Enhancing performance with pipelining 71 Dept. of Computer Architecture, UMA, Oct 2014
Two “Types” of Stalls
Noop instruction (or bubble) inserted between two
instructions in the pipeline (as done for load-use
situations)
Keep the instructions earlier in the pipeline (later in the code)
from progressing down the pipeline for a cycle (“bounce” them in
place with write control signals)
Insert noop by zeroing control bits in the pipeline register at the
appropriate stage
Let the instructions later in the pipeline (earlier in the code)
progress normally down the pipeline

Flushes (or instruction squashing) were an instruction in


the pipeline is replaced with a noop instruction (as done
for instructions located sequentially after j instructions)
Zero the control bits for the instruction to be flushed

Chapter 2. Enhancing performance with pipelining 72 Dept. of Computer Architecture, UMA, Oct 2014
Supporting ID Stage Jumps
Jump
PCSrc

Shift ID/EX
EX/MEM
left 2

IF/ID Control

Add
Branch MEM/WB
PC+4[31-28] Add
4 Shift
left 2
Read Addr 1
Instruction Register Read Data
Memory Read Addr 2Data 1 Memory
Read 0
PC

File Address Read


Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data
ALU
16 Sign 32 cntrl
Extend

Forward
Unit

Chapter 2. Enhancing performance with pipelining 73 Dept. of Computer Architecture, UMA, Oct 2014
One Way to “Fix” a Control Hazard: Detention

Fix branch

ALU
I beq IM Reg DM Reg hazard by
n waiting –
s flush – but

ALU
t flush IM Reg DM Reg
affects CPI
r.

ALU
IM Reg DM Reg
O flush
r

ALU
d IM Reg DM Reg
e flush
r

ALU
IM Reg DM Reg
beq target

ALU
IM Reg DM
Inst 3

Chapter 2. Enhancing performance with pipelining 75 Dept. of Computer Architecture, UMA, Oct 2014
Another Way to “Fix” Control Hazard
Move branch decision hardware back to as early in
the pipeline as possible – i.e., during the decode cycle

ALU
beq IM Reg DM Reg Fix branch
I
n hazard by
s waiting –

ALU
t flush IM Reg DM Reg flush
r.

ALU
O IM Reg DM Reg
r beq target
d

ALU
e IM Reg DM
r Inst 3

Chapter 2. Enhancing performance with pipelining 76 Dept. of Computer Architecture, UMA, Oct 2014
Reducing the Delay of Branches
Move the branch decision hardware back to the EX stage
Reduces the number of stall (flush) cycles to two
Adds an and gate and a 2x1 mux to the EX timing path

Add hardware to compute the branch target address and


evaluate the branch decision to the ID stage
Reduces the number of stall (flush) cycles to one
(like with jumps)
- But now need to add forwarding hardware in ID stage
Computing branch target address can be done in parallel with
RegFile read (done for all instructions – only used when needed)
Comparing the registers can’t be done until after RegFile read, so
comparing and updating the PC adds a mux, a comparator, and an
and gate to the ID timing path

For deeper pipelines, branch decision points can be even


later in the pipeline, incurring more stalls
Chapter 2. Enhancing performance with pipelining 77 Dept. of Computer Architecture, UMA, Oct 2014
ID Branch Forwarding Issues
MEM/WB “forwarding” WB add3 $1,
is taken care of by the MEM add2 $3,
normal RegFile write EX add1 $4,
before read operation ID beq $1,$2,Loop
IF next_seq_instr

Need to forward from the WB add3 $3,


EX/MEM pipeline stage to MEM add2 $1,
the ID comparison EX add1 $4,
ID beq $1,$2,Loop
hardware for cases like
IF next_seq_instr
if (IDcontrol.Branch
and (EX/MEM.RegisterRd != 0) Forwards the
and (EX/MEM.RegisterRd = IF/ID.RegisterRs)) result from the
ForwardC = 1 second
if (IDcontrol.Branch previous instr.
and (EX/MEM.RegisterRd != 0) to either input
and (EX/MEM.RegisterRd = IF/ID.RegisterRt)) of the compare
ForwardD = 1
Chapter 2. Enhancing performance with pipelining 78 Dept. of Computer Architecture, UMA, Oct 2014
ID Branch Forwarding Issues, con’t
If the instruction immediately WB add3 $3,
before the branch produces MEM add2 $4,
one of the branch source EX add1 $1,
ID beq $1,$2,Loop
operands, then a stall needs IF next_seq_instr
to be inserted (between the
beq and add1) since the EX stage ALU operation is
occurring at the same time as the ID stage branch
compare operation
“Bounce” the beq (in ID) and next_seq_instr (in IF) in place
(ID Hazard Unit deasserts PC.Write and IF/ID.Write)
Insert a stall between the add in the EX stage and the beq in
the ID stage by zeroing the control bits going into the ID/EX
pipeline register (done by the ID Hazard Unit)
If the branch is found to be taken, then flush the
instruction currently in IF (IF.Flush)
Chapter 2. Enhancing performance with pipelining 79 Dept. of Computer Architecture, UMA, Oct 2014
Supporting ID Stage Branches
Branch
PCSrc

Hazard ID/EX
Unit EX/MEM
0 1
IF/ID Control 0

Add
Shift MEM/WB
4 Add

Compare
IF.Flush

left 2

Read Addr 1
Instruction RegFile Data
Memory Read Addr 2 Memory
Read 0
PC

Read Data 1 Read Data


Address Write Addr ALU Address
ReadData 2
Write Data Write Data
ALU
16 Sign cntrl
Extend 32

Forward
Unit

Forward
Unit

Chapter 2. Enhancing performance with pipelining 80 Dept. of Computer Architecture, UMA, Oct 2014
Delayed Branches
If the branch hardware has been moved to the ID stage,
then we can eliminate all branch stalls with delayed
branches which are defined as always executing the next
sequential instruction after the branch instruction – the
branch takes effect after that next instruction
MIPS compiler moves an instruction to immediately after the
branch that is not affected by the branch (a safe instruction)
thereby hiding the branch delay

With deeper pipelines, the branch delay grows requiring


more than one delay slot
Delayed branches have lost popularity compared to more
expensive but more flexible (dynamic) hardware branch prediction
Growth in available transistors has made hardware branch
prediction relatively cheaper

Chapter 2. Enhancing performance with pipelining 81 Dept. of Computer Architecture, UMA, Oct 2014
Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then
delay slot sub $4,$5,$6

becomes becomes becomes


add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6

A is the best choice, fills delay slot and reduces IC


In B and C, the sub instruction may need to be copied, increasing IC
In B and C, must be okay to execute sub when branch fails
Chapter 2. Enhancing performance with pipelining 82 Dept. of Computer Architecture, UMA, Oct 2014
Branch Prediction
Static branch prediction
Based on typical branch behavior
Example: loop and if-statement branches
- Predict backward branches taken
- Predict forward branches not taken

Dynamic branch prediction


Hardware measures actual branch behavior
- e.g., record recent history of each branch
Assume future behavior will continue the trend
- When wrong, stall while re-fetching, and update history

Chapter 4 — The Processor


83 — 83
Chapter 2. Enhancing performance with pipelining Dept. of Computer Architecture, UMA, Oct 2014
Static Branch Prediction
Resolve branch hazards by assuming a given outcome
and proceeding without waiting to see the actual branch
outcome
1. Predict not taken – always predict branches will not be
taken, continue to fetch from the sequential instruction
stream, only when branch is taken does the pipeline stall
If taken, flush instructions after the branch (earlier in the pipeline)
- in IF, ID, and EX stages if branch logic in MEM – three stalls
- In IF and ID stages if branch logic in EX – two stalls
- in IF stage if branch logic in ID – one stall
ensure that those flushed instructions haven’t changed the
machine state – automatic in the MIPS pipeline since machine
state changing operations are at the tail end of the pipeline
(MemWrite (in MEM) or RegWrite (in WB))
restart the pipeline at the branch destination

Chapter 2. Enhancing performance with pipelining 84 Dept. of Computer Architecture, UMA, Oct 2014
Flushing with Misprediction (Not Taken)

ALU
IM Reg DM Reg
I 4 beq $1,$2,2
n
s

ALU
8 flush
sub $4,$1,$5 IM Reg DM Reg
t
r.

ALU
16 and $6,$1,$7 IM Reg DM Reg
O
r
d

ALU
20 or r8,$1,$9 IM Reg DM Reg
e
r

To flush the IF stage instruction, assert IF.Flush to


zero the instruction field of the IF/ID pipeline register
(transforming it into a noop)
Chapter 2. Enhancing performance with pipelining 85 Dept. of Computer Architecture, UMA, Oct 2014
Branching Structures
Predict not taken works well for “top of the loop”
branching structures Loop: beq $1,$2,Out
1nd loop instr
But such loops have jumps at the .
bottom of the loop to return to the .
top of the loop – and incur the .
jump stall overhead last loop instr
j Loop
Out: fall out instr

Predict not taken doesn’t work well for “bottom of the


loop” branching structures Loop: 1st loop instr
2nd loop instr
.
.
.
last loop instr
bne $1,$2,Loop
fall out instr

Chapter 2. Enhancing performance with pipelining 86 Dept. of Computer Architecture, UMA, Oct 2014
Static Branch Prediction, con’t
Resolve branch hazards by assuming a given outcome
and proceeding

2. Predict taken – predict branches will always be taken


Predict taken always incurs one stall cycle (if branch
destination hardware has been moved to the ID stage)
Is there a way to “cache” the address of the branch target
instruction ??

As the branch penalty increases (for deeper pipelines),


a simple static prediction scheme will hurt performance.
With more hardware, it is possible to try to predict
branch behavior dynamically during program execution
3. Dynamic branch prediction – predict branches at run-
time using run-time information
Chapter 2. Enhancing performance with pipelining 87 Dept. of Computer Architecture, UMA, Oct 2014
Dynamic Branch Prediction
A branch prediction buffer (aka branch history table
(BHT)) in the IF stage addressed by the lower bits of the
PC, contains bit(s) passed to the ID stage through the
IF/ID pipeline register that tells whether the branch was
taken the last time it was execute
Prediction bit may predict incorrectly (may be a wrong prediction
for this branch this iteration or may be from a different branch
with the same low order PC bits) but the doesn’t affect
correctness, just performance
- Branch decision occurs in the ID stage after determining that the
fetched instruction is a branch and checking the prediction bit(s)
If the prediction is wrong, flush the incorrect instruction(s) in
pipeline, restart the pipeline with the right instruction, and invert
the prediction bit(s)
- A 4096 bit BHT varies from 1% misprediction (nasa7, tomcatv) to
18% (eqntott)

Chapter 2. Enhancing performance with pipelining 88 Dept. of Computer Architecture, UMA, Oct 2014
Branch Target Buffer
The BHT predicts when a branch is taken, but does not
tell where its taken to!
A branch target buffer (BTB) in the IF stage caches the branch
target address, but we also need to fetch the next sequential
instruction. The prediction bit in IF/ID selects which “next”
instruction will be loaded into IF/ID at the next clock edge
- Would need a two read port
instruction memory

BTB

Or the BTB can cache the


branch taken instruction while the Instruction
Memory
instruction memory is fetching the Read 0

PC
next sequential instruction Address

If the prediction is correct, stalls can be avoided no matter


which direction they go
Chapter 2. Enhancing performance with pipelining 89 Dept. of Computer Architecture, UMA, Oct 2014
1-bit Prediction Accuracy
A 1-bit predictor will be incorrect twice when not taken
Assume predict_bit = 0 to start (indicating
branch not taken) and loop control is at
the bottom of the loop code Loop: 1st loop instr
2nd loop instr
1. First time through the loop, the predictor .
mispredicts the branch since the branch is .
taken back to the top of the loop; invert .
prediction bit (predict_bit = 1) last loop instr
bne $1,$2,Loop
2. As long as branch is taken (looping), fall out instr
prediction is correct
3. Exiting the loop, the predictor again
mispredicts the branch since this time the
branch is not taken falling out of the loop;
invert prediction bit (predict_bit = 0)
For 10 times through the loop we have a 80% prediction
accuracy for a branch that is taken 90% of the time
Chapter 2. Enhancing performance with pipelining 90 Dept. of Computer Architecture, UMA, Oct 2014
2-bit Predictors
A 2-bit scheme can give 90% accuracy since a prediction
must be wrong twice before the prediction bit is changed

right 9 times Loop: 1st loop instr


2nd loop instr
wrong on loop .
Taken fall out .
Not taken .
Predict 1 last loop instr
1 Predict 11 10
Taken Taken bne $1,$2,Loop
Taken fall out instr
Taken right on 1st Not taken
iteration
Not taken 00Predict 0
0 Predict 01 BHT also
Not Taken Not Taken
Taken stores the
Not taken
initial FSM
state
Chapter 2. Enhancing performance with pipelining 91 Dept. of Computer Architecture, UMA, Oct 2014
Summary
All modern day processors use pipelining for
performance (a CPI of 1 and fast a CC)
Pipeline clock rate limited by slowest pipeline stage –
so designing a balanced pipeline is important
Must detect and resolve hazards
Structural hazards – resolved by designing the pipeline
correctly
Data hazards
- Stall (impacts CPI)
- Forward (requires hardware support)
Control hazards – put the branch decision hardware in as
early a stage in the pipeline as possible
- Stall (impacts CPI)
- Delay decision (requires compiler support)
- Static and dynamic prediction (requires hardware support)
Pipelining complicates exception handling

Chapter 2. Enhancing performance with pipelining 92 Dept. of Computer Architecture, UMA, Oct 2014

You might also like