0% found this document useful (0 votes)
48 views169 pages

Chapter 04

computer_4

Uploaded by

k0966493450.ee11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views169 pages

Chapter 04

computer_4

Uploaded by

k0966493450.ee11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface


6th
Edition

Chapter 4
The Processor
The Processor ?

Chapter 4 — The Processor — 2


§4.1 Introduction
Introduction
 We will learn
 How the ISA determines many aspects of the implementation
 How the choice of various implementation strategies affects the
clock rate and CPI for the computer
 We will examine two MIPS implementations
 A simplified version
 A more realistic pipelined version
 A simple subset of ISA, shows most aspects
 Memory reference: lw, sw
 Arithmetic/logical operation: add, sub, and, or, slt
 Program flow control: beq, j

Chapter 4 — The Processor — 3


Instruction Cycle
Instruction  For every instruction, the first three phases are identical:
 Instruction fetch: send PC to the memory and fetch the
Fetch
instruction from the memory

Instruction
 Instruction decode and operand fetch: read one or two
Decode
registers, using fields of the instruction to select the
register from the register file (RF)
Operand
Fetch
 Use ALU, depending on instruction class, to calculate
 Arithmetic result
Execute
 Memory address for load/store
 Branch target address
Result
 Access data memory only for load/store
Store
 Write the ALU or memory back into a register,
 using fields of the instruction to select the register
Next
 PC  target address or PC + 4
Instruction
Datapath vs Controller
Datapath Controller

signals

Control Points

 Datapath: Storage, FU, interconnect sufficient to perform the desired


functions
 Inputs are Control Points
 Outputs are signals
 Controller: State machine to orchestrate/control operation on the data path
 Based on desired function and signals
Chapter 4 — The Processor — 5
MIPS Datapath (Simplified Ver.) (1/3)

Chapter 4 — The Processor — 6


Add Necessary Multiplexers (2/3)
 Can’t just join
wires together
 Use multiplexers

Chapter 4 — The Processor — 7


Add Control (3/3)

Chapter 4 — The Processor — 8


5-Step to Implement a Processor
1. Analyze the instruction set (datapath requirements)
 The meaning of each instruction is given by the register transfers
 Datapath must include storage element
 Datapath must support each register transfer
2. Select set of datapath components and establish
clocking methodology
3. Assemble datapath meeting the requirements
4. Analyze the implementation of each instruction to
determine setting of control points effecting register
transfer
5. Assemble the control logic

Chapter 4 — The Processor — 9


Step 1: Analyze the Instruction Set
 All MIPS instructions are 32 bits long with 3 formats:
 R-type: 31 26 21 16 11 6 0
op rs rt rd shamt funct
 I-type: 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
31 26 21 16 0
op rs rt immediate
 J-type: 6 bits 5 bits 5 bits 16 bits
31 26 0
op target address
6 bits 26 bits
 The different fields are:
 op: operation of the instruction
 rs, rt, rd: source and/or destination register
 shamt: shift amount
 funct: selects variant of the “op” field
 address / immediate
 target address: target address of jump
Step 1: Analyze the Instruction Set
 Arithmetic/logical operation:
 add rd, rs, rt
 sub rd, rs, rt 31 26 21 16 11 6 0
 and rd, rs, rt op rs rt rd shamt funct
 or rd, rs, rt 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
 slt rd, rs, rt
 Load/Store: 31 26 21 16 0
 lw rt,rs,imm16 op rs rt immediate
 sw rt,rs,imm16 6 bits 5 bits 5 bits 16 bits
 Imm operand:
 addi rt,rs,imm16
 Branch:
 beq rs,rt,imm16
 Jump:
31 26 21 16 0
 j target
op address
6 bits 26 bits

Chapter 4 — The Processor — 11


Logical Register-Transfer Level (RTL)
 RTL is a design abstraction, which gives the hardware description of
the instructions
MEM[ PC ] = op | rs | rt | rd | shamt | funct
or = op | rs | rt | Imm16
or = op | Imm26 (added at the end)
Inst Register transfers
ADD R[rd] <- R[rs] + R[rt]; PC <- PC + 4
SUB R[rd] <- R[rs] - R[rt]; PC <- PC + 4
LOAD R[rt] <- MEM[ R[rs] + sign_ext(Imm16)]; PC <- PC + 4
STORE MEM[ R[rs] + sign_ext(Imm16) ] <-R[rt]; PC <- PC + 4
ADDI R[rt] <- R[rs] + sign_ext(Imm16)]; PC <- PC + 4
BEQ if (R[rs] == R[rt]) then PC <- PC + 4 + sign_ext(Imm16)] || 00
else PC <- PC + 4
J PC <- PC[31..28] || Imm 26 || 00

Chapter 4 — The Processor — 12


§4.2 Logic Design Conventions
Step 2: Datapath Elements
 Information encoded in binary
 Low voltage = 0, High voltage = 1
 One wire per bit; multi-bit data encoded on bus
 Two different types of datapath elements
 Combinational elements
 For computation, the output depends only on the current
inputs
 The output is a function of the input(s)
 State (sequential) elements
 For storing state/information
 The output depends on both the input(s) and the contents of
the internal state

Chapter 4 — The Processor — 13


Combinational Elements
 Example of combinational logic elements :

CarryIn Select
A
32 Adder
A
32

MUX
Sum
32 Y
32
B Carry B
32 32
Adder MUX
Sum = A + B ALU control Y=S?A:B
4
A
32
ALU

Result
32
B Result = F(A, B)
32
ALU
Sequential Elements (1/2)
 D-type flip-flop: stores data in a circuit
 Uses a clock signal to determine when to update the
stored value
 Edge-triggered: update when Clk changes from 0 to 1

Clk
D Q
D

Clk
Q

Chapter 4 — The Processor — 15


Sequential Elements (2/2)
 Registers (or register file) and Memory with write
control
 Only updates on clock edge when write_enable
control input is 1
 Used when stored value is required later
Clk

D Q Write

Write D
Clk
Q

Chapter 4 — The Processor — 16


Clocking Methodology
 A clocking methodology defines when signals can be read and
when they can be written
 Combinational logic transforms data during clock cycles
 Between clock edges (edge-triggered clocking methodology)
 Input from state elements, output to state element
 Longest delay (or critical path) determines clock period

May be encountered a race problem

Chapter 4 — The Processor — 17


§4.3 Building a Datapath
Step 3: Building a Datapath
 Datapath
 Elements that process data and addresses
in the CPU
 Registers, ALUs, mux’s, memories, …
 We will build a MIPS datapath incrementally
 Refining the overview design

Chapter 4 — The Processor — 18


Instruction Fetch Unit
 Instruction fetch unit is used by other parts of the
datapath
 Fetch the instruction: mem[PC]
 Update the program counter:
 Sequential code: PC <- PC + 4
 Branch and Jump: PC <- “Something else”

Increment by
4 for next
32-bit
instruction
register
Step 3a: R-Format Instructions
 Read two register operands
 Perform arithmetic/logical operation
 Write register result

Chapter 4 — The Processor — 20


Add and Subtract
 R[rd] <- R[rs] op R[rt] Ex: add rd, rs, rt
 Ra, Rb, Rw come from inst.’s rs, rt, and rd fields
 ALU and RegWrite: control logic after decode
31 26 21 16 11 6 0
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
Two read ports and
one write port
4 ALU operation (funct)
rs Read
register 1
R ead Ra
rt data 1
Read
register 2 Zero
Instruction
Registers ALU ALU Rw
rd W rite result
register Rb
R ead
data 2
W rite
data

RegW rite

Chapter 4 — The Processor — 21


Load/Store Instructions
 Read register operands
 Calculate address using 16-bit
offset
 Use ALU, but sign-extend
offset
 Load: Read memory and update
register
 Store: Write register value to
memory

Chapter 4 — The Processor — 22


Step 3b: Store/Load Operations
 R[rt]<-Mem[ R[rs]+SignExt[imm16] ] Ex: lw rt,rs,imm16

31 26 21 16 0
op rs rt immediate
6 bits 5 bits 5 bits 16 bits

rs 43 ALU operation
Read
register 1 MemWrite
Read
rt Read
data 1
Instruction register 2 Zero
Registers ALU ALU
rt Write result Address
Read
register data
Read
data 2
Write Data
data
memory
RegWrite Write
data
16 32
Sign MemRead
extend
R-Type/Load/Store Datapath

Chapter 4 — The Processor — 24


Recall Branch Instructions
 Read register operands
 Compare operands
 Use ALU, subtract and check Zero output
 Calculate target address
 Sign-extend displacement
 Shift left 2 places (word displacement)
 Add to PC + 4
 Already calculated by instruction fetch

Chapter 4 — The Processor — 25


Branch Operations
 beq rs, rt, imm16
31 26 21 16 0
op rs rt immediate
6 bits 5 bits 5 bits 16 bits

mem[PC] Fetch inst. from memory

COND <- R[rs] == R[rt] Calculate branch condition

if (COND == 0) Calculate next inst. address


PC <- PC + 4 + ( SignExt(imm16) x 4 )
else
PC <- PC + 4

Chapter 4 — The Processor — 26


Step 3c: Branch Instructions
Just
re-routes
wires

Sign-bit wire
replicated

Chapter 4 — The Processor — 27


Structure the Datapath
 First-cut datapath does an instruction in one
clock cycle
 Each datapath element can only do one function at a
time
 Hence, we need separate instruction and data
memories
 Use multiplexers where alternate data sources
are used for different instructions

Chapter 4 — The Processor — 28


A Single Cycle Full Datapath
PCSrc

1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2

Instruction [25– 21] Read


Read register 1 Read MemWrite
PC data 1
address Instruction [20– 16] Read MemtoReg
ALUSrc
Instruction register 2 Zero
1 Read ALU ALU
[31– 0] Write data 2 1 Read
M result Address 1
u register M data
Instruction Instruction [15– 11] x u M
memory Write x u
0 data Registers x
0
Write Data 0
RegDst data memory
Instruction [15– 0] 16 Sign 32
extend ALU MemRead
control
Instruction [5– 0]

ALUOp

Chapter 4 — The Processor — 29


Clocking Methodology
 Define when signals are read and written
 Assume edge-triggered (synchronous design):
 Values in storage (state) elements updated only on a clock edge
=> clock edge should arrive only after input signals stable
 Any combinational circuit must have inputs from and outputs to
storage elements
 Clock cycle: time for signals to propagate from one storage
element, through combinational circuit, to reach the second
storage element
 A register can be read, its value propagated through some
combinational circuit, new value is written back to the same
register, all in same cycle => no feedback within a single cycle

Chapter 4 — The Processor — 30


Register-Register Timing
Clk
Clk-to-Q
PC Old Value New Value
Instruction Memory Access Time
Rs, Rt, Rd,
Old Value New Value
Op, Func
Delay through Control Logic
ALUctr Old Value New Value

RegWr Old Value New Value


Register File Access Time
busA, B Old Value New Value
ALU Delay
busW Old Value New Value

Ideal 32
Rd Rs Rt Register Write
Instruction RegWr ALUctr
5 5 5 Occurs Here
Memory
busA
Rw Ra Rb
PC

busW 32

ALU
32 32-bit Result
32 Registers 32
Clk busB
Clk
32
The Critical Path
 Register file and ideal memory:
 During read, behave as combinational logic:
 Address valid => Output valid after access time

Critical Path (Load Operation) =


PC’s Clk-to-Q +
Ideal Instruction memory’s Access Time +
Instruction Register file’s Access Time +
Memory Instruction
ALU to Perform a 32-bit Add +
Rd Rs Rt Imm Data Memory Access Time +
5 5 5 16 Setup Time for Register File Write +
Instruction Clock Skew
Address
A Data
Rw Ra Rb 32 Address
Next Address

32 32 Ideal
32 32-bit
ALU Data
PC

Registers Data In Memory


B

Clk Clk
32
Clk
Worst Case Timing (Load)
Clk
Clk-to-Q
PC Old Value New Value
Instruction Memoey Access Time
Rs, Rt, Rd, Old Value New Value
Op, Func
Delay through Control Logic
ALUctr Old Value New Value

ExtOp Old Value New Value

ALUSrc Old Value New Value

MemtoReg Old Value New Value Register


Write Occurs
RegWr Old Value New Value
Register File Access Time
busA Old Value New Value
Delay through Extender & Mux
busB Old Value New Value
ALU Delay
Address Old Value New Value
Data Memory Access Time
busW Old Value New
Step 4: Control Points and Signals
Instruction<31:0> • To select the operations to
Inst.

<21:25>

<21:25>

<16:20>

<11:15>

<0:15>
perform
Memory
Addr • To control the flow of data
OpFunct Rt Rs Rd Imm16

Control
PCsrc RegDst ALUSrc MemWr MemtoReg
RegWr MemRd ALUct Equal
r

Datapath

Chapter 4 — The Processor — 34


7 Control Signals

Chapter 4 — The Processor — 35


§4.4 A Simple Implementation Scheme
ALU Control (1)
 ALU used for
 Load/Store: F = add

 Branch: F = subtract

 R-type: F depends on funct field

ALU control Function


0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
1100 NOR

Chapter 4 — The Processor — 36


ALU Control (2)
 Assume 2-bit ALUOp derived from opcode
 Combinational logic derives ALU control

opcode ALUOp Operation funct ALU function ALU control


lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010
beq 01 branch equal XXXXXX subtract 0110
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111

Chapter 4 — The Processor — 37


The Main Control Unit
 Control signals derived from instruction

R-type 0 rs rt rd shamt funct


31:26 25:21 20:16 15:11 10:6 5:0

Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0

Branch 4 rs rt address
31:26 25:21 20:16 15:0

opcode always read, write for sign-extend


read except R-type and add
for load and load

Chapter 4 — The Processor — 38


Designing Main Control
 Some observations:
 opcode (Op[5-0]) is always in bits 31-26
 two registers to be read are always in rs (bits 25-21) and rt
(bits 20-16) (for R-type, beq, sw)
 base register for lw and sw is always in rs (25-21)
 16-bit offset for beq, lw, sw is always in 15-0
 destination register is in one of two positions:
 lw: in bits 20-16 (rt)
 R-type: in bits 15-11 (rd)
 need a multiplex to select the address for written register
 need a multiplex to select the input for ALU

Chapter 4 — The Processor — 39


Datapath with Mux and Control

PCSrc

1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2

Instruction [25– 21] Read


Read register 1 Read MemWrite
PC data 1
address Instruction [20– 16] Read MemtoReg
ALUSrc
Instruction register 2 Zero
1 Read ALU ALU
[31– 0] Write data 2 1 Read
M result Address 1
u register M data
Instruction Instruction [15– 11] x u M
memory Write x u
0 data Registers x
0
Write Data 0
RegDst data memory
Instruction [15– 0] 16 Sign 32
extend ALU MemRead
control
Instruction [5– 0]
Control point ALUOp

Chapter 4 — The Processor — 40


Datapath With Control

Chapter 4 — The Processor — 41


For R-Type Instruction

Chapter 4 — The Processor — 42


For I-Type (lw) Instruction

Chapter 4 — The Processor — 43


For I-Type (beq) Instruction

Chapter 4 — The Processor — 44


Implementing J-Type Instructions

Jump 2 address
31:26 25:0

PC <- PC[31..28] || Imm 26 || 00

 Update PC with concatenation of top 4 bits of old PC, 26-bit


jump address, and 002
 Jump looks somewhat like a branch, but always computes
the target PC (i.e. not conditional)
 Jump uses word address
 Need an extra control signal decoded from opcode

Chapter 4 — The Processor — 45


Datapath With Jumps Added

Chapter 4 — The Processor — 46


Concluding Remarks
 Not feasible to vary clock period for different instructions
 Longest delay determines clock period
 Critical path: load instruction
 Instruction memory  register file  ALU  data memory 
register file

 “Making the common case fast” cannot improve the


worst-case delay  Single cycle implementation violates
the design principle
 We will improve performance by pipelining

Chapter 4 — The Processor — 47


Pipelining Implementation
 Critical path reduction
storage element
storage element

Acyclic
Combinational
Acyclic
Logic (A)
Combinational
Logic
==>
storage element

Acyclic
Combinational
Logic (B)
storage element

storage element
§4.6 An Overview of Pipelining
Pipelining Analogy
 Pipelined laundry: overlapping execution
 Parallelism improves performance

 Four loads:
 Speedup

= 8/3.5 = 2.3
 Non-stop:
 Speedup

= 2n/0.5n + 1.5 ≈ 4
= number of stages

Chapter 4 — The Processor — 49


Steps for Designing a Pipelined Processor

1. Examine the datapath and control diagram

 We will start with the single cycle datapath

2. (Well-balanced) Partition datapath into stages

3. Associate resources with stages

4. Ensure no conflict, or figure out how to resolve

5. Assert control in appropriate stage

Chapter 4 — The Processor — 50


5-Stage MIPS Pipeline
 Five steps, one stage per step
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register

Chapter 4 — The Processor — 51


Partition Single-Cycle Datapath
Ins. fetch
 Add registers between steps RF access
ALU operation
memory access
Write back

PCSrc

M
Add u
x
4 Add ALU
result
Shift
left 2
Registers 43
Read ALU operation
MemWrite
Read register 1 ALUSrc
PC Read
address Read data 1 MemtoReg
register 2 Zero
Instruction ALU ALU
Write Read Address Read
register M result data
data 2 u M
Instruction u
memory Write x Data x
data memory
Write
RegWrite data
16 32
Sign
extend MemRead
Pipeline Performance
 Assume time for stages is
 100ps for register read or write
 200ps for other stages
 Compare pipelined datapath with single-cycle
datapath

Instr Instr fetch Register ALU op Memory Register Total time


read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps

Chapter 4 — The Processor — 53


Pipeline Performance
Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

Chapter 4 — The Processor — 54


Pipeline Speedup
 If all stages are balanced
 i.e., all take the same time
 Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
 If not balanced, speedup is less
 Speedup due to increased throughput
 Latency (time for each instruction) does not decrease

Chapter 4 — The Processor — 55


Pipelining Lessons
 Doesn’t help latency of single task, but throughput of
entire
 Pipeline rate limited by slowest stage
 Multiple tasks working at same time using different
resources
 Potential speedup = Number pipe stages
 Unbalanced stage length; time to “fill” & “drain” the
pipeline reduce speedup
 Stall for dependences or pipeline hazards

— 56
MIPS ISA Designed for Pipelining
Pipelining RISC is more easier than
 All instructions are 32-bits that of CISC

 Easier to fetch and decode in one cycle


 c.f. x86: 1- to 17-byte instructions
 Few and regular instruction formats
 Can decode and read registers in one step
 Load/store addressing
 Can calculate address in 3rd stage, access memory in
4th stage
 Alignment of memory operands
 Memory access takes only one cycle

Chapter 4 — The Processor — 57


Pipelined Datapath
Use registers between stages to carry data and control
Pipeline registers (latches)
0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Read
Instruction

PC Address register 1 Read


data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data
16 32
Sign
extend

Lecture06 - pipelining ([email protected]) — 58


Consider Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load Ifetch Reg/Dec Exec Mem Wr

 IF: Instruction Fetch


 Fetch the instruction from the Instruction Memory
 ID: Instruction Decode
 Registers fetch and instruction decode
 EX: Calculate the memory address
 MEM: Read the data from the Data Memory
 WB: Write the data back to the register file

— 59
Pipelining lw Instructions
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock

1st lw Ifetch Reg/Dec Exec Mem Wr

2nd lw Ifetch Reg/Dec Exec Mem Wr

3rd lw Ifetch Reg/Dec Exec Mem Wr

 5 functional units in the pipeline datapath are:


 Instruction Memory for the Ifetch stage
 Register File’s Read ports (busA and busB) for the Reg/Dec
stage
 ALU for the Exec stage
 Data Memory for the MEM stage
 Register File’s Write port (busW) for the WB stage
— 60
The Four Stages of R-type Instruction

Cycle 1 Cycle 2 Cycle 3 Cycle 4

R-type Ifetch Reg/Dec Exec Wr

 IF: fetch the instruction from the Instruction Memory


 ID: registers fetch and instruction decode
 EX: ALU operates on the two register operands
 WB: write ALU output back to the register file

Lecture06 - pipelining ([email protected]) — 61


Hazard Problem
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock

R-type Ifetch Reg/Dec Exec Wr Ops! We have a problem !


R-type Ifetch Reg/Dec Exec Wr

Load Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Wr

R-type Ifetch Reg/Dec Exec Wr

 We have a structural hazard:


 Two instructions try to write to the RF at the same time, but only
one write port !

Lecture06 - pipelining ([email protected]) — 62


Pipeline Hazards
 Situations that prevent starting the next instruction in the
next cycle
 Structure hazard
 A required resource is busy
 Data hazard
 Need to wait for previous instruction to complete its data
read/write
 Control hazard
 Deciding on control action depends on previous instruction
 Several ways to solve: forwarding, adding pipeline bubble,
making instructions same length
Chapter 4 — The Processor — 63
Structure Hazards
 Conflict for use of a resource

 In MIPS pipeline with a single memory

 Load/store requires data access

 Instruction fetch would have to stall for that cycle

 Would cause a pipeline “bubble”

 Two-port single memory

 Or a separate instruction memory and data memory


(or separate instruction/data caches)

Chapter 4 — The Processor — 64


Structural Hazard Solution
Time

ALU
Mem Reg Mem Reg
I Load
n

ALU
s Instr 1 Mem Reg Mem Reg
t

ALU
r. Mem Reg Mem Reg
Instr 2
O

ALU
Instr 3 Mem Reg Mem Reg
r
d

ALU
e Instr 4 Mem Reg Mem Reg
r

1. I/D separate memory: data memory and instruction memory


2. First half cycle for write and the second half cycle for read

Lecture06 - pipelining ([email protected]) — 65


Structural Hazard Solution:
Delay R-type’s Write
 Delay R-type’s register write by one cycle:
 R-type also use Reg File’s write port at Stage 5

 MEM is a NOP stage: nothing is being done.

1 2 3 4 5 R-type also has 5


R-type Ifetch Reg/Dec Exec Mem Wr stages

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9


Clock

R-type Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Mem Wr

Load Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Mem Wr


The Four Stages of sw

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Store Ifetch Reg/Dec Exec Mem Wr

 IF: fetch the instruction from the Instruction Memory


 ID: registers fetch and instruction decode

 EX: calculate the memory address

 MEM: write the data into the Data Memory

Add an extra stage:


 WB: NOP

Lecture06 - pipelining ([email protected]) — 67


The Three Stages of beq
Cycle 1 Cycle 2 Cycle 3 Cycle 4

Beq Ifetch Reg/Dec Exec Mem Wr

 IF: fetch the instruction from the Instruction Memory


 ID: registers fetch and instruction decode

 EX:

 compares the two register operand

 select correct branch target address

 latch into PC

Add two extra stages:


 MEM: NOP

 WB: NOP

Lecture06 - pipelining ([email protected]) — 68


Data Hazards
 An instruction depends on completion of data
access by a previous instruction
 add $s0, $t0, $t1
sub $t2, $s0, $t3

Chapter 4 — The Processor — 69


Types of Data Hazards
Three types: (inst. i1 followed by inst. i2)
 RAW (read after write): True data dependency
i2 tries to read operand before i1 writes it
 WAR (write after read): Name dependency
i2 tries to write operand before i1 reads it
 Gets wrong operand, e.g., autoincrement addr.
 Can’t happen in MIPS 5-stage pipeline because:
 All instructions take 5 stages, and reads are always in stage 2, and writes are always in
stage 5

 WAW (write after write): Name dependency


i2 tries to write operand before i1 writes it
 Leaves wrong result ( i1’s not i2’s); occur only in pipelines that write in more than
one stage
 Can’t happen in MIPS 5-stage pipeline because:
 All instructions take 5 stages, and writes are always in stage 5
 RAR?
No dependency

Chapter 4 — The Processor — 70


Handling Data Hazards
 Use simple, fixed designs
 Eliminate WAR by always fetching operands early (ID) in pipeline
 Eliminate WAW by doing all write backs in order (last stage,
static)
 These features have a lot to do with ISA design
 Internal forwarding in register file:
 Write in first half of clock and read in second half
 Read delivers what is written, resolve hazard between sub and
add
 Detect and resolve remaining ones
 Compiler inserts NOP, or reorders the code sequence
 Forward
 Stall

Chapter 4 — The Processor — 71


Forwarding (aka Bypassing)
 Use result when it is computed
 Don’t wait for it to be stored in a register
 Requires extra connections in the datapath
Hardware complexity

Chapter 4 — The Processor — 72


Example
 Consider the following code sequence

sub $2, $1, $3


and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)

Chapter 4 — The Processor — 73


Data Hazards Solution:
Inserting NOPs by Software
Time (in clock cycles)

Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/ -2 0 -20 -20 -20 -20
Program
execution
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

Insert two nops


and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

Lecture06 - pipelining ([email protected]) — 74


Data Hazards Solution:
Internal Forwarding Logic
 Use temporary results, e.g., those in pipeline registers, don’t wait for
them to be written
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Value of EX/MEM : X X X – 20 X X X X X
Value of MEM/WB : X X X X – 20 X X X X

Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg


HW Change for Forwarding
Additional hardware is required.

NextPC

mux
Registers

EX/MEM

MEM/WR
ALU
ID/EX

Data
mux

Memory

mux
Immediate

Chapter 4 — The Processor — 76


Load-Use Data Hazard
 Can’t always avoid stalls by forwarding
 If value not computed when needed
 Can’t forward backward in time!

How to insert a bubble ???


Software Check or Hardware Handling

Chapter 4 — The Processor — 77


Rescheduling Code to Avoid Stalls
 Compiler reorders the code sequence to avoid use of
load result in the next instruction
 C code for A = B + E; C = B + F;

lw $t1, 0($t0) lw $t1, 0($t0)


lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles

Chapter 4 — The Processor — 78


Control Hazards
 Branch determines flow of control

 Fetching next instruction depends on branch outcome

 Pipeline might not fetch correct instruction

 Still working on ID stage of branch

 In MIPS pipeline

 Need to compare registers and compute target


address in the pipeline

Chapter 4 — The Processor — 79


Control Hazard on Branches

10: beq r1,r3,36

ALU
Ifetch Reg DMem Reg

ALU
Reg Reg
14: and r2,r3,r5 Ifetch DMem

ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem

ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg

What do you do with the 3 instructions in between?


The simplest solution is to stall the pipeline as soon as a branch instruction is detected

Chapter 4 — The Processor — 80


Branch Stall Impact
 If CPI = 1 and 30% conditional branch,
 Stall 3 cycles => new CPI = 1.9 !
 Two-part solution:
 Determine branch taken or not sooner, AND
 Compute taken branch address earlier
 MIPS branch tests if register = 0 or  0
 MIPS Solution:
 Move Zero test to ID/RF stage
 Adder to calculate new PC in ID/RF stage
 Add hardware in ID stage for conditional branch
 1 clock cycle penalty vs. 3

Chapter 4 — The Processor — 81


Stall on Branch
 Wait until branch outcome determined before fetching
next instruction

How to add a stall cycle ?

Chapter 4 — The Processor — 82


Four Alternatives for Control Hazard (1/2)
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken


– But haven’t calculated branch target address in MIPS, it still
incurs 1 cycle branch penalty
– Advantage of branch target is known before outcome

Chapter 4 — The Processor — 83


MIPS with Predict Not Taken

Prediction
correct

Prediction
incorrect

Chapter 4 — The Processor — 84


Four Alternatives for Control Hazard (2/2)
#4: Delayed Branch – make the stall cycle useful
– Define branch to take place AFTER a following instruction

branch instruction
sequential successor1
sequential successor2
........ Branch delay of length n
sequential successorn These insts. are executed !!
branch target if taken

– 0-cycle latency, if all the stall cycles are useful

Chapter 4 — The Processor — 85


More-Realistic Branch Prediction
 Static n-bit branch prediction (discussed later)
 Based on typical branch behavior
 Example: loop and if-statement branches
 Predict backward branches taken
 Predict forward branches not taken

 Dynamic branch prediction


 Hardware/Software measures actual branch behavior
 e.g., record recent history of each branch

 Assume future behavior will continue the trend


 When wrong, stall while re-fetching, and update history

Chapter 4 — The Processor — 86


Pipeline Summary
The BIG Picture

 Pipelining improves performance by increasing the


number of simultaneously executing instructions
 Executes multiple instructions in parallel
 Each instruction has the same latency
 Subject to hazards
 Structure, data, and control
 Instruction set design affects complexity of pipeline
implementation
 RISC vs. CISC

Chapter 4 — The Processor — 87


§4.7 Pipelined Datapath and Control
Recall: Steps for Designing a Pipelined
Processor
 Examine the datapath and control diagram
 Starting with single cycle datapath
 Partition datapath into stages:
 IF (instruction fetch), ID (instruction decode and register file read),
EX (execution or address calculation), MEM (data memory
access), WB (write back)
 Associate resources with stages
 Ensure that flows do not conflict, or figure out how to
resolve
 Assert control in appropriate stage

Chapter 4 — The Processor — 88


MIPS Single-Cycle Datapath

Two right-to-left flows

MEM

Right-to-left WB
flows lead to
hazards

Chapter 4 — The Processor — 89


Pipeline Registers
Use registers between stages to carry data and control
Pipeline registers (latches)
0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add result

Shift
left 2

Read
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
16 32
Sign
extend

Chapter 4 — The Processor — 90


MIPS ISA Micro-Operations
One way to show what happens in pipelined execution

Action for R-type Action for memory-reference


Step name Action for branches Action for jumps
instructions instructions
IR = Memory[PC]
Instruction fetch
PC = PC + 4
Instruction decode & A = Reg [IR[25-21]]; B = Reg [IR[20-16]]
register fetch if (A ==B) then, ALUOut = PC + (sign-extend (IR[15-0]) << 2)
Execution/ address ALUOut = A op B ALUOut = A + sign-extend PC = PC [31-28] II
computation (IR[15-0]) (IR[25-0]<<2)

Memory access or R-type Load: MDR = Memory[ALUOut]


completion or
Store: Memory [ALUOut] = B
Memory read completion/ Reg [IR[15-11]] =
Load: Reg[IR[20-16]] = MDR
R-type completion ALUOut

Chapter 4 — The Processor — 91


Pipeline Micro-Operation
 Cycle-by-cycle flow of instructions through the
pipelined datapath
 Shows pipeline usage in a single cycle (stage)

 Highlight resources used

 We’ll look at “single-clock-cycle” diagrams for


load instruction

Chapter 4 — The Processor — 92


IF Stage of lw
 Ex: lw rt,rs,imm16
lw
Instruction fetch
IR, PC+4
0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Read
Instruction

PC Address register 1 Read


data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
16 32
Sign
extend

Chapter 4 — The Processor — 93


Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data

ID Stage of lw
Write x u
memory x
data 1
0
Write
data
16 32
Sign
extend

 Ex: lw rt,rs,imm16
 A = Reg[IR[25-21]]; B = Sign-ext(IR[15-0])
lw
0
M Instruction decode
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Read
Instruction

PC Address register 1 Read


Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
16 32
Sign
extend

Chapter 4 — The Processor — 94


EX Stage of lw
 Ex: lw rt,rs,imm16
 ALUout = A + B
lw
0
M
u
Execution
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Read
Instruction

PC Address register 1 Read


data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
16 32
Sign
extend

Chapter 4 — The Processor — 95


MEM State of lw
 Ex: lw rt,rs,imm16
 MDR = mem[ALUout]
lw
0
M
u
Memory
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add result

Shift
left 2

Read
Instruction

PC Address register 1
Read
Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
Data M
u
Write x memory u
data x
1
0
Write
data
16 32
Sign
extend

Chapter 4 — The Processor — 96


Write data 2 result Address 1
register M data
Data M
u
Write x memory u

WB Stage of lw
data x
1
0
Write
data
16 32
Sign
extend

 Ex: lw rt,rs,imm16
 Reg[IR[20-16]] = MDR
wrong IR !!
0 Who will supply this lw
M
u address? Write back
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Read
Instruction

PC Address register 1 Read


data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M Data data
u M
memory u
Write x
data x
1
0
Write
data
16 32
Sign
extend
Corrected Datapath for lw
Reg[IR[20-16]] = MDR
0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Read
Instruction

PC Address register 1 Read


data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data
16 32
Sign
extend
Graphically Representing Pipelines
 Two representations

 Multiple-clock-cycle pipeline diagram

 Single-clock-cycle pipeline diagram

 Example
lw $10, 20($1)
sub $11, $2, $3
add $12, $3, $4
lw $13, 24($1)
add $14, $5, $6

Chapter 4 — The Processor — 99


Multiple-Clock-Cycle Pipeline Diagram (1/2)
 Traditional form
 Can show the resource usage at each CC

Chapter 4 — The Processor — 100


Multiple-Clock-Cycle Pipeline Diagram (2/2)

Can show the resource usage at each CC

Chapter 4 — The Processor — 101


Single-Clock-Cycle Pipeline Diagram
 The clock cycle 5 of the pipeline

Chapter 4 — The Processor — 102


Pipelined Control (1/2)
 Why control? A single datapath used by several types of instruction
 Start with the same ALU control logic, branch logic, destination-register-number
MUX, and control lines used by the simplified single-cycle datapath
Pipelined Control (2/2)
 To specify control for the pipeline, we need to set control
values during each pipeline stage.

 The simplest implementation way is data stationary


pipelined control
 to extend the pipeline registers to include control
information

Chapter 4 — The Processor — 104


Data Stationary Pipelined Control
 Control signals derived from instruction
 Main control generates control signals during ID
 Pass control signals along just like the data
Control Lines for the Final Three Stages

 Signals for EX (ExtOp, ALUSrc, ...) are used 1 cycle later


 Signals for MEM (MemWr, Branch) are used 2 cycles later
 Signals for WB (MemtoReg, MemWr) are used 3 cycles later

ID EX MEM WB

ExtOp ExtOp
ALUSrc ALUSrc

MEM/WB Register
Ex/MEM Register
ALUOp ALUOp
ID/Ex Register
IF/ID Register

Main RegDst RegDst


Control
MemWr MemWr MemW
Branch Branch Branch

MemtoReg MemtoReg MemtoReg MemtoReg


RegWr RegWr RegWr RegWr

Lecture06 - pipelining ([email protected]) — 106


Pipelined Datapath with Control Signals
§4.8 Data Hazards: Forwarding vs. Stalling
Data Hazards: Forwarding vs. Stalling
 Consider the following code sequence:
sub $2, $1,$3
and $12,$2,$5
or $13,$6,$2
add $14,$2,$2
sw $15,100($2)

 We can resolve hazards with forwarding…

Chapter 4 — The Processor — 108


Dependencies & Forwarding

How do the hardware detect when to do forward?

Chapter 4 — The Processor — 109


Forwarding Unit for Hazard Detection (1/2)
 We need to detect RAW (WAR, WAW) hazard
 Compare R/W register number between instructions

 Pass register numbers along pipeline

 ID/EX.RegisterRs  register number for Rs sitting in

ID/EX pipeline register


 ALU operand register numbers in EX stage are given by

ID/EX.RegisterRs and ID/EX.RegisterRt


 Data hazard conditions are
Fwd from
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt pipeline reg

2a. MEM/WB.RegisterRd = ID/EX.RegisterRs Fwd from


MEM/WB
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt pipeline reg

Chapter 4 — The Processor — 110


Forwarding Unit for Hazard Detection (2/2)
 Not always do forward when data hazard conditions are true
 Because some instructions do not write registers

 Do forward only if (1) data hazard conditions are true; and (2)
the forwarding instruction will write to a register and the
destination register is not $zero
 Check if EX/MEM.RegWrite is active for 1a/1b and if
MEM/WB.RegWrite is active for 2a/2b

 Check if EX/MEM.RegisterRd ≠ 0 for 1a/1b and check if


MEM/WB.RegisterRd ≠ 0 for 2a/2b

Chapter 4 — The Processor — 111


Forwarding Paths

Chapter 4 — The Processor — 112


Control Values for the Forwarding Muxs

Mux control Source Explanation

ForwardA = 00 ID/EX The first ALU operand comes from the register file.

The first ALU operand is forwarded from the prior ALU


ForwardA = 10 EX/MEM
result.

The first ALU operand is forwarded from data memory or


ForwardA = 01 MEM/WB
an earlier ALU result.

ForwardB = 00 ID/EX The second ALU operand comes from the register file.

The second ALU operand is forwarded from the prior


ForwardB = 10 EX/MEM
ALU result.

The second ALU operand is forwarded from data


ForwardB = 01 MEM/WB
memory or an earlier ALU result.

Chapter 4 — The Processor — 113


Forwarding Conditions
 EX hazard
 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
 MEM hazard
 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

Chapter 4 — The Processor — 114


Double Data Hazard
 Consider the sequence:
add $1,$1,$2
add $1,$1,$3
add $1,$1,$4
 Both hazards occur  Want to use the most recent
 Revise MEM hazard condition  Only forward if EX
hazard condition isn’t true

Chapter 4 — The Processor — 115


Revised Forwarding Conditions
 MEM hazard
 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01

 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)


and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
課本p. 323有錯喔 !!

Chapter 4 — The Processor — 116


Datapath with Forwarding

Chapter 4 — The Processor — 117


Load-Use Data Hazard

Even forwarding cannot solve load-use data hazard

Need to stall
for one cycle

Chapter 4 — The Processor — 118


Load-Use Hazard Conditions
 Check when the use-instruction is decoded in ID stage
 ALU operand register numbers in ID stage are given by
IF/ID.RegisterRs, IF/ID.RegisterRt
 Load-use hazard when
 ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
 If detected, stall and insert bubble

Chapter 4 — The Processor — 119


Stall/Bubble in the Pipeline (1/2)

Stall inserted
here

Chapter 4 — The Processor — 120


How to Stall the Pipeline?
 Force control values in ID/EX register to 0
 EX, MEM and WB do nop (no-operation)
 Prevent update of PC and IF/ID register
 Using instruction is decoded again
 Following instruction is fetched again
 1-cycle stall allows MEM to read data for lw
 Can subsequently forward to EX stage

Chapter 4 — The Processor — 121


Stall/Bubble in the Pipeline (2/2)

Or, more
accurately…
Pipelined Control with Forwarding Unit and
Hazard Detection Unit

Chapter 4 — The Processor — 123


Stalls and Performance
The BIG Picture

 Stalls reduce pipeline performance


 But are required to get correct results
 Compiler can arrange code to avoid hazards and
stalls
 Requires knowledge of the pipelined datapath

Chapter 4 — The Processor — 124


§4.9 Control Hazards
Branch Hazards
 If branch outcome determined in MEM

Original Datapath

Flush these
instructions
(Set control
values to 0)

PC

Chapter 4 — The Processor — 125


Reducing Branch Delay
 Move hardware to determine branch outcome to ID stage
 Target address adder additional hardware
 Register comparator
Pipelined Branch Example
 Assumed the pipeline is optimized for branch-not-taken and that we
moved the branch execution to the ID stage
 Show what happens when the branch is taken
36: sub $10, $4, $8
40: beq $1, $3, 7 #PC-relative addressing
#target = 40+4+7*4 = 72
44: and $12, $2, $5
48: or $13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
...
72: lw $4, 50($7)

Chapter 4 — The Processor — 127


Clock Cycle 3 after …

True

72

48
Clock Cycle 4 after …
One pipeline bubble on a taken branch
Data Hazards for Branch -- I
 If the comparison registers are a destination of 1st and
2nd preceding ALU instruction

add $1, $2, $3 IF ID EX MEM WB

add $4, $5, $6 IF ID EX MEM WB

beq $1, $4, target IF ID EX MEM WB

IF ID EX MEM WB

 Can resolve using forwarding, but need 1 stall cycle

Chapter 4 — The Processor — 130


Data Hazards for Branch -- I
 Solution

add $1, $2, $3 IF ID EX MEM WB

add $4, $5, $6 IF ID EX MEM WB

beq stalled IF ID

beq $1, $4, target ID EX MEM WB

Chapter 4 — The Processor — 131


Data Hazards for Branch -- II
 If a comparison register is a destination of immediately
preceding load instruction
 Can resolve using forwarding, but need 2 stall cycles

lw $1, addr IF ID EX MEM WB

beq stalled IF ID

beq stalled ID

beq $1, $0, target ID EX MEM WB

Chapter 4 — The Processor — 132


Delayed Branch
 Predict-not-taken + branch decision at ID
=> the following instruction is always executed
=> branches take effect 1 cycle later

I Time (clock cycles)


n

ALU
s Mem Reg Mem Reg
add
t
r.

ALU
beq Mem Reg Mem Reg

ALU
r misc Mem Reg Mem Reg
d

ALU
e lw Mem Reg Mem Reg
r

 0 clock cycle penalty per branch instruction if can find instruction to put in
slot

Lecture06 - pipelining ([email protected]) — 133


Scheduling the Branch Delay Slot

1. A is the best choice, fills delay slot & reduces instruction count (IC)
2. In B, the sub instruction may need to be copied, increasing IC
3. In B and C, must be okay to execute sub when branch fails
Delay-Branch Scheduling Schemes

Scheduling Improve Performance


Requirements
Strategy When?

Branch must not depend on the


From before Always
rescheduled instructions

When branch is taken.


Must be OK to execute rescheduled
May enlarge program
From target instructions if branch is not taken.
if instructions are
May need to duplicate instructions
duplicated

From fall- Must be OK to execute instructions When branch is not


through if branch is taken taken.

Chapter 4 — The Processor — 135


Dynamic Branch Prediction
 In deeper and superscalar pipelines, branch
penalty is more significant
 Use dynamic branch prediction
 Branch prediction buffer (aka branch history table)
 Indexed by recent branch instruction addresses
 Stores outcome (taken/not taken)
 To execute a branch
 Check table, expect the same outcome
 Start fetching from fall-through or target
 If wrong, flush pipeline and flip prediction  1-bit predictor

Chapter 4 — The Processor — 136


Shortcoming for 1-Bit Predictor
 Inner loop branches mispredicted twice !!

outer: …

inner: …
… T NT
beq …, …, inner

beq …, …, outer

 Mispredict as taken on last iteration of inner loop


 Then mispredict as not taken on first iteration of
inner loop next time around

Chapter 4 — The Processor — 137


2-Bit Predictor
 Only change prediction on two successive mispredictions

Chapter 4 — The Processor — 138


Calculating the Branch Target Address
 Even with predictor, still need to calculate the
target address
 1-cycle penalty for a taken branch in 5-stage MIPS
processor
 Branch target buffer (discussed in CA course)
 Cache of target addresses
 Indexed by PC when instruction fetched
 If hit and instruction is branch predicted taken, can fetch
target immediately
 0-cycle penalty

Chapter 4 — The Processor — 139


§4.10 Exceptions
Exceptions and Interrupts
 “Unexpected” events requiring change in flow of control
 Different ISAs use the terms differently
 Exception: Arises within the CPU
 e.g., undefined opcode, overflow, syscall, …
 Interrupt: From an external I/O controller

 Dealing with execeptions without sacrificing performance


is hard
Chapter 4 — The Processor — 140
Handling Exceptions in MIPS
 In MIPS, exceptions managed by a System Control
Coprocessor (CP0)
 1. Save PC of offending (or interrupted) instruction in
Exception Program Counter (EPC)
 2. Save indication of the problem in Cause register
 Must know the reason for the exception
 Cause is a status register
 We’ll assume 1-bit flag for each reason
 3. Save registers in memory (similar to procedure call)
 4. Jump to exception handler at 8000 00180

Chapter 4 — The Processor — 141


An Alternate Mechanism
 Vectored Interrupts
 Handler address determined by the cause
 Example:

 OS knows the reason for the exception by the address


at which it is initiated.

Note: when the exception is not vectored, a single entry point for all
exceptions is used.

Chapter 4 — The Processor — 142


Handler Actions
 Read cause, and transfer to relevant handler
 Determine action required
 If restartable Must subtract 4 from EPC
 Take corrective action
 use EPC to return to program (also need to restore
the saved registers from memory)
 Otherwise
 Terminate program
 Report error using EPC, cause, …

Chapter 4 — The Processor — 143


Exceptions in a Pipeline
 Another form of control hazard (Similar to mispredicted
branch)
 Use much of the same hardware
 Complete previous instruction (in the pipeline)
 Flush itself and subsequent instructions (in the pipeline)
 Set Cause register and EPC (actually PC+4 is saved)
 Transfer control to handler (similar to procedure call)

 Re-startable exception
 E.g. cache miss
 Refetch then execute the instruction from scratch

Chapter 4 — The Processor — 144


Pipeline with Control to Handle Exceptions
Zeros control signals for flushing
Example: Exception in MIPS Processor
 Given the instruction sequence:
40H sub $11, $2, $4
44H and $12, $2, $5
48H or $13, $2, $6
4CH add $1, $2, $1
50H slt $15, $6, $7
54H lw $16, 50($7)

 Assume the instructions to be invoked on an exception handle
begins like this:
80000180H sw $26, 1000($0)
80000184H sw $27, 1004($0)
 Show what happen in the pipeline if an overflow exception in the add
instruction.

Chapter 4 — The Processor — 146


Exception Example
4Ch + 4 = 50h saved in EPC

Clock 6

Chapter 4 — The Processor — 147


Exception Example
The add and following instructions are flushed

Clock 7

Chapter 4 — The Processor — 148


Multiple Exceptions
 Pipelining overlaps multiple instructions
 Could have multiple exceptions at once
 Simple approach: deal with exception from
earliest instruction
 Flush subsequent instructions
 “Precise” exceptions
 In complex pipelines
 Multiple instructions issued per cycle
 Out-of-order completion
 Maintaining precise exceptions is difficult! (discussed
in CA course)

Chapter 4 — The Processor — 149


Imprecise Exceptions
 Just stop pipeline and save state
 Including exception cause(s)
 Let the handler work out
 Which instruction(s) had exceptions
 Which to complete or flush
 May require “manual” completion
 Simplifies hardware, but more complex handler
software
 Not feasible for complex multiple-issue
out-of-order pipelines

Chapter 4 — The Processor — 150


§4.10 Parallelism via Instructions
Instruction-Level Parallelism (ILP)
 Pipelining: executing multiple instructions in parallel
 To increase ILP
 Deeper pipeline (increase clock rate)
 Less work per stage  shorter clock cycle
 Multiple issue (using multiple ALUs)
 Replicate pipeline stages  multiple pipelined datapaths
 Start multiple instructions per clock cycle
 CPI < 1, so use Instructions Per Cycle (IPC)
 E.g., 4GHz 4-way multiple-issue (upto 4 parallel instructions)
 16 BIPS, peak CPI = 0.25, peak IPC = 4, ideally
 But dependencies reduce this in practice

Chapter 4 — The Processor — 151


Multiple Issue Processor
 Static multiple issue or VLIW processor
 Compiler solves hazards, groups instructions to be issued
together, and packages them into “issue slots”
 Compiler detects and avoids hazards
 Dynamic multiple issue or Superscalar processor
 CPU examines instruction stream and chooses instructions
to issue each cycle
 Compiler can help by reordering instructions
 CPU resolves hazards using advanced techniques at
runtime
 Rescheduling and loop unrolling techniques for
multiple issue processors

Chapter 4 — The Processor — 152


MIPS with Static Dual Issue
 Two-issue packets
 One for ALU/branch instruction and the other for
load/store instruction
 64-bit aligned, 2-issue slot
 Pad an unused instruction with nop
 Peak IPC = 2
Address Instruction type Pipeline Stages

n ALU/branch IF ID EX MEM WB

n+4 Load/store IF ID EX MEM WB

n+8 ALU/branch IF ID EX MEM WB

n + 12 Load/store IF ID EX MEM WB

n + 16 ALU/branch IF ID EX MEM WB

n + 20 Load/store IF ID EX MEM WB

Chapter 4 — The Processor — 153


Code Rescheduling
 Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $zero, Loop

 After code rescheduling:


 Loop: lw $t0, 0($s1)
addi $s1, $s1, -4
addu $t0, $t0, $s2
sw $t0, 4($s1)
bne $s1, $zero, Loop

Chapter 4 — The Processor — 154


Loop Unrolling
 Replicate loop body to expose more parallelism
 Reduces loop-control overhead
 Use different registers per replication
 Called “register renaming”
 Avoid loop-carried “anti-dependencies”
 Store followed by a load of the same register
 Aka “name dependence”
 Reuse of a register name

Chapter 4 — The Processor — 155


Multiple-Issue Code Scheduling
 2-issue processor

 CPI: 4/5 = 0.8 (or IPC = 1.25) Blank is nop


 Assume the loop index is a multiple of four
 After four-times loop unrolling and code scheduling

 CPI: 8/14 = 0.57 (or IPC = 1.75)


Closer to 2, but at cost of registers and code size
Chapter 4 — The Processor — 156
Static Multiple Issue Processor
 Compiler must remove some/all hazards
 Reorder instructions into issue packets
 No dependencies with a packet
 Possibly some dependencies between
packets
 Varies between ISAs; compiler must know!
 Insert nop(s), if necessary
 Software complexity Hardware complexity

Chapter 4 — The Processor — 157


Two-Issue MIPS VLIW Processor

Chapter 4 — The Processor — 158


Dynamic Multiple Issue Processor
 CPU decides whether to issue 0, 1, 2, … each
cycle (out-of-order execution and completion)
 Avoiding structural and data hazards
 Avoids the need for compiler scheduling
 Though it may still help
 Code semantics ensured by the CPU
 Old code still run
 May not re-compile the code for new version
 Hardware complexity Software complexity

Chapter 4 — The Processor — 159


Superscalar Processor
Preserves
dependencies

Hold pending
operands

Results also sent


to any waiting
reservation stations

Reorders buffer for


register writes
Can supply
operands for
issued instructions

Chapter 4 — The Processor — 160


Speculation
 Predict and continue to do with an instruction
 Start operation as soon as possible
 Check whether guess was right
 If so, complete the operation
 If not, roll-back and do the right thing
 Common to static and dynamic multiple issue
 Examples
 Speculate on branch outcome
 Roll back if path taken is different
 Speculate on load
 Roll back if location is updated

Chapter 4 — The Processor — 161


Compiler/Hardware Speculation
 Compiler can reorder instructions
 e.g., move load before branch
 Can include “fix-up” instructions to recover from
incorrect guess
 Hardware can look ahead for instructions to
execute
 Buffer results until it determines they are actually
needed
 Flush buffers on incorrect speculation

Chapter 4 — The Processor — 162


Speculation and Exceptions
 What if exception occurs on a speculatively
executed instruction?
 e.g., speculative load before null-pointer check
 Static speculation
 Can add ISA support for deferring exceptions
 Dynamic speculation
 Can buffer exceptions until instruction completion
(which may not occur)

Chapter 4 — The Processor — 163


Does Multiple Issue Work?
The BIG Picture
 Yes, but not as much as we’d like
 Programs have real dependencies that limit ILP
 Some dependencies are hard to eliminate
 e.g., pointer aliasing
 Some parallelism is hard to expose
 Limited window size during instruction issue
 Memory delays and limited bandwidth
 Hard to keep pipelines full
 Speculation can help if done well

Chapter 4 — The Processor — 164


Multicore/Multiprocessor is the Trend
 Complexity of multiple-issue processors requires power
 Multiple simpler cores may be better
10000

??%/year
1000
Performance (vs. VAX-11/780)

52%/year

100

Pipelining,
Data locality,
10
Parallelism processing
25%/year

1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Chapter 4 — The Processor — 165
§4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines
Cortex A8 and Intel i7
Processor ARM A8 Intel Core i7 920
Market Personal Mobile Device Server, cloud
Thermal design power 2 Watts 130 Watts
Clock rate 1 GHz 2.66 GHz
Cores/Chip 1 4
Floating point? No Yes
Multiple issue? Dynamic Dynamic
Peak instructions/clock cycle 2 4
Pipeline stages 14 14
Pipeline schedule Static in-order Dynamic out-of-order
with speculation
Branch prediction 2-level 2-level
1st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D
2nd level caches/core 128-1024 KiB 256 KiB
3rd level caches (shared) - 2- 8 MB

Chapter 4 — The Processor — 166


§4.14 Fallacies and Pitfalls
Fallacies
 Pipelining is easy (!)
 The basic idea is easy
 The devil is in the details
 e.g., detecting data hazards
 Pipelining is independent of technology
 So why haven’t we always done pipelining?
 More transistors make more advanced techniques
feasible
 Pipeline-related ISA design needs to take account of
technology trends
 e.g., predicated instructions

Chapter 4 — The Processor — 167


Pitfalls
 Poor ISA design can make pipelining
harder
 e.g., complex instruction sets (VAX, IA-32)
 Significant overhead to make pipelining work
 IA-32 micro-op approach
 e.g., complex addressing modes
 Register update side effects, memory indirection
 e.g., delayed branches
 Advanced pipelines have long delay slots

Chapter 4 — The Processor — 168


§4.14 Concluding Remarks
Concluding Remarks
 ISA influences design of datapath and control
 Datapath and control influence design of ISA
 Pipelining improves instruction throughput
using parallelism
 More instructions completed per second
 Latency for each instruction not reduced
 Hazards: structural, data, control
 Multiple issue and dynamic scheduling (ILP)
 Dependencies limit achievable parallelism
 Complexity leads to the power wall

Chapter 4 — The Processor — 169

You might also like