0% found this document useful (0 votes)
238 views134 pages

Kiến Trúc Máy Tính CS2009: Khoa Khoa học và Kỹ thuật Máy tính BM Kỹ thuật Máy tính Võ Tấn Phương

CPU performance factors - instruction count Determined by ISA and compiler - CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations - a simplified version - a more realistic pipelined version Instruction Execution PC - instruction memory, fetch instruction file, register numbers - register file read registers Depending on instruction class - Use ALU to calculate Arithmetic result memory address for load / store Branch target address - Access data memory for load - PC - target address or PC

Uploaded by

thanhlinh9191
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
238 views134 pages

Kiến Trúc Máy Tính CS2009: Khoa Khoa học và Kỹ thuật Máy tính BM Kỹ thuật Máy tính Võ Tấn Phương

CPU performance factors - instruction count Determined by ISA and compiler - CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations - a simplified version - a more realistic pipelined version Instruction Execution PC - instruction memory, fetch instruction file, register numbers - register file read registers Depending on instruction class - Use ALU to calculate Arithmetic result memory address for load / store Branch target address - Access data memory for load - PC - target address or PC

Uploaded by

thanhlinh9191
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

dce

2009

KIẾN TRÚC MÁY TÍNH


CS2009
Khoa Khoa học và Kỹ thuật Máy tính
BK
BM Kỹ thuật Máy tính
TP.HCM

Võ Tấn Phương
http://www.cse.hcmut.edu.vn/~vtphuong/KTMT
http://www.cse.hcmut.edu.vn/ vtphuong/KTMT
©2009, CE Department
dce
2009
Chapter 3

The Processor

Adapted from Computer Organization and


Design, 4th Edition, Patterson & Hennessy, ©
2008

11/17/2009 ©2009, CE Department 2


dce
2009
The Five classic Components of a Computer

11/17/2009 ©2009, CE Department 3


dce
2009
Introduction
• CPU performance factors
– Instruction count
• Determined by ISA and compiler
– CPI and Cycle time
• Determined by CPU hardware
• We will examine two MIPS implementations
– A simplified version
– A more realistic pipelined version
• Simple subset, shows most aspects
– Memory
M reference:
f l sw
lw,
– Arithmetic/logical: add, sub, and, or, slt
– Control transfer: beq, j

©2009, CE Department
dce
2009
Instruction Execution
• PC → instruction memory, fetch instruction
• Register numbers → register file
file, read registers
• Depending on instruction class
– Use ALU to calculate
• Arithmetic result
• Memory address for load/store
• Branch target address
– Access data memory for load/store
– PC ← target address or PC + 4

©2009, CE Department
dce
2009
CPU Overview

©2009, CE Department
dce
2009
Multiplexers
• Can’t just join
wires together
– Use multiplexers

©2009, CE Department
dce
2009
Control

©2009, CE Department
dce
2009
Logic Design Basics
• Information encoded in binary
– Low voltage = 0 0, High voltage = 1
– One wire per bit
– Multi-bit
Multi bit data encoded on multi
multi-wire
wire buses
• Combinational element
– Operate on data
– Output is a function of input
• State (sequential) elements
– Store information

©2009, CE Department
dce
2009
Combinational Elements

• AND-gate • Adder A
Y
+
–Y=A&B –Y=A+B B

A
Y
B

• Arithmetic/Logic Unit
• Multiplexer – Y = F(A, B)
– Y = S ? I1 : I0
A
I0 M
u Y ALU Y
I1 x
B
S F

©2009, CE Department
dce
2009
Sequential Elements
• Register: stores data in a circuit
– Uses a clock signal to determine when to
update the stored value
– Edge-triggered: update when Clk changes
from 0 to 1

Clk
D Q
D

Clk
Q

©2009, CE Department
dce
2009
Sequential Elements
• Register with write control
– Only updates on clock edge when write
control input is 1
– Used when stored value is required later

Clk

D Q Write

Write D
Clk
Q

©2009, CE Department
dce
2009
Clocking Methodology
• Combinational logic transforms data
during
du g clock
c oc cycles
cyc es
– Between clock edges
– Input
p from state elements,, output
p to state
element
– Longest delay determines clock period

©2009, CE Department
dce
2009
Building a Datapath
• Datapath
– Elements that process data and addresses
in the CPU
• Registers, ALUs, mux’s,
mux s, memories, …
• We will build a MIPS datapath
incrementally
– Refining the overview design

©2009, CE Department
dce
2009
Instruction Fetch

Increment by
4 for next
32-bit instruction
register

©2009, CE Department
dce
2009
Review Instruction Formats

©2009, CE Department 16
dce
2009
R-Format Instructions
• Read two register operands
• Perform arithmetic/logical operation
• Write register result

©2009, CE Department
dce
2009
Load/Store Instructions
• Read register operands
g 16-bit offset
• Calculate address using
– Use ALU, but sign-extend offset
• Load: Read memory and update register
• Store: Write register value to memory

©2009, CE Department
dce
2009
Branch Instructions
• Read register operands
• Compare operands
– Use ALU, subtract and check Zero output
• Calculate target address
– Sign-extend displacement
– Shift left 2 places (word displacement)
– Add to PC + 4
• Already calculated by instruction fetch

©2009, CE Department
dce
2009
Branch Instructions
Just
re routes
re-routes
wires

Sign bit wire


Sign-bit
replicated

©2009, CE Department
dce
2009
Composing the Elements
• First-cut data path does an instruction in
one clock cycle
– Each datapath element can only do one
function at a time
– Hence, we need separate instruction and data
memories
• Use multiplexers where alternate data
sources are used for different instructions

©2009, CE Department
dce
2009
R-Type/Load/Store Datapath

©2009, CE Department
dce
2009
Full Datapath

©2009, CE Department
dce
2009
ALU Control
• ALU used for
– Load/Store: F = add
– Branch: F = subtract
– R-type:
R type: F depends on funct field
ALU control Function
0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
1100 NOR

©2009, CE Department
dce
2009
ALU Control
• Assume 2-bit ALUOp derived from opcode
– Combinational logic derives ALU control

opcode ALUOp Operation funct ALU function ALU control


lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010
beq
q 01 branch equal
q XXXXXX subtract 0110
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111

©2009, CE Department
dce
2009
The Main Control Unit
• Control signals derived from instruction

R-type 0 rs rt rd shamt funct


31:26 25:21 20:16 15:11 10:6 5:0

Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0

Branch 4 rs rt address
31:26 25:21 20:16 15:0

opcode always read, write for sign-extend


read except R-type and add
f load
for l d and
d lload
d

©2009, CE Department
dce
2009
Datapath With Control

©2009, CE Department
dce
2009
R-Type Instruction

©2009, CE Department
dce
2009
Load Instruction

©2009, CE Department
dce
2009
Branch-on-Equal Instruction

©2009, CE Department
dce
2009
Implementing Jumps
Jump 2 address
31 26
31:26 25 0
25:0

• Jump uses word address


• Update PC with concatenation of
– Top
op 4 bbits
ts o
of o
old
d PC
C
– 26-bit jump address
– 00
• Need an extra control signal decoded from
opcode
©2009, CE Department
dce
2009
Datapath With Jumps Added

©2009, CE Department
dce
2009
Performance Issues
• Longest delay determines clock period
– Critical path: load instruction
– Instruction memory → register file → ALU →
data memory → register file
• Not feasible to vary period for different
instructions
• Violates design principle
– Making the common case fast
• We will improve performance by pipelining

©2009, CE Department
dce
2009
Pipelining Analogy
• Pipelined laundry: overlapping execution
– Parallelism improves performance

• Four loads:
– Speedup
= 8/3
8/3.5
5 = 2.3
3
• Non-stop:
– Speedup
p p
= 2n/0.5n + 1.5 ≈ 4
= number of stages

©2009, CE Department
dce
2009
MIPS Pipeline
• Five stages, one step per stage
1.
1 IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5
5. WB Write
WB: W it resultlt back
b k to
t register
i t

©2009, CE Department
dce
2009
Pipeline Performance
• Assume time for stages is
– 100ps for register read or write
– 200ps for other stages
• Compare
p p
pipelined
p datapath
p with single-cycle
g y
datapath

Instr Instr fetch Register ALU op Memory Register Total time


read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps

©2009, CE Department
dce
2009
Pipeline Performance
Single-cycle (Tc= 800ps)

Pipelined
p ((Tc= 200ps)
p )

©2009, CE Department
dce
2009
Pipeline Speedup
• If all stages are balanced
– i.e.,
i e all take the same time
– Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
• If nott balanced,
b l d speedup
d iis lless
• Speedup due to increased throughput
– Latency (time for each instruction) does not
decrease

©2009, CE Department
dce
2009
Pipelining and ISA Design
• MIPS ISA designed for pipelining
– All instructions are 32-bits
32 bits
• Easier to fetch and decode in one cycle
• c.f. x86: 1- to 17-byte instructions
– Few and regular instruction formats
• Can decode and read registers in one step
– Load/store
L d/ t addressing
dd i
• Can calculate address in 3rd stage, access
memoryy in 4th stage
g
– Alignment of memory operands
• Memory access takes only one cycle

©2009, CE Department
dce
2009
Hazards
• Situations that prevent starting the next
st uct o in the
instruction t e next
e t cycle
cyc e
• Structure hazards
– A required resource is busy
• Data hazard
– Need to wait for previous instruction to
complete its data read/write
• Control hazard
– Deciding on control action depends on
previous instruction

©2009, CE Department
dce
2009
Structure Hazards
• Conflict for use of a resource
• In MIPS pipeline with a single memory
– Load/store requires data access
– Instruction
I i fetch
f h would
ld h
have to stall
t ll for
f that
h
cycle
• Would cause a pipeline “bubble”
bubble
• Hence, pipelined datapaths require
separate
t instruction/data
i t ti /d t memories i
– Or separate instruction/data caches

©2009, CE Department
dce
2009
Data Hazards
• An instruction depends on completion of
data access by a previous instruction
– add $s0, $t0, $t1
sub $t2 $s0,
$t2, $s0 $t3

©2009, CE Department
dce
2009
Forwarding (aka Bypassing)
• Use result when it is computed
– Don
Don’tt wait for it to be stored in a register
– Requires extra connections in the datapath

©2009, CE Department
dce
2009
Load-Use Data Hazard
• Can’t always avoid stalls by forwarding
– If value not computed when needed
– Can’t forward backward in time!

©2009, CE Department
dce
2009
Code Scheduling to Avoid Stalls
• Reorder code to avoid use of load result in
the next instruction
• C code for A = B + E; C = B + F;

lw $t1, 0($t0) lw $t1, 0($t0)


lw $t2,
, 4($t0) lw $t2,, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4
$t4, 8($t0) sw $t3
$t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles

©2009, CE Department
dce
2009
Control Hazards
• Branch determines flow of control
– Fetching next instruction depends on branch
outcome
– Pipeline can’t always fetch correct instruction
• Still working on ID stage of branch
• In MIPS pipeline
– Need to compare registers and compute
target early in the pipeline
– Add hardware
h d to
t do
d it in
i ID stage
t

©2009, CE Department
dce
2009
Stall on Branch
• Wait until branch outcome determined
before fetching next instruction

©2009, CE Department
dce
2009
Branch Prediction
• Longer pipelines can’t readily determine
branch outcome early
– Stall penalty becomes unacceptable
• Predict o
outcome
tcome of branch
– Only stall if prediction is wrong
• In MIPS pipeline
– Can predict branches not taken
– Fetch instruction after branch, with no delay

©2009, CE Department
dce
2009
MIPS with Predict Not Taken

Prediction
correct

Prediction
incorrect

©2009, CE Department
dce
2009
More-Realistic Branch Prediction
• Static branch prediction
– Based on typical branch behavior
– Example: loop and if-statement branches
• Predict backward branches taken
• Predict forward branches not taken
• Dynamic branch prediction
– Hardware
H d measures actual
t lb branch
hbbehavior
h i
• e.g., record recent history of each branch
– Assume
ssu e future
u u e be
behavior
a o will co
continue
ue the
e trend
e d
• When wrong, stall while re-fetching, and update history

©2009, CE Department
dce
2009
Pipeline Summary
The BIG Picture

• Pipelining improves performance by


increasing instruction throughput
– Executes multiple instructions in parallel
– Each instruction has the same latency
• Subject to hazards
– Structure, data, control
• Instruction set design affects complexity of
pipeline implementation
©2009, CE Department
dce
2009
MIPS Pipelined Datapath

MEM

Right-to-left WB
flow leads to
hazards

©2009, CE Department
dce
2009
Pipeline registers
• Need registers between stages
– To hold information produced in previous cycle

©2009, CE Department
dce
2009
Pipeline Operation
• Cycle-by-cycle flow of instructions through
the pipelined datapath
– “Single-clock-cycle” pipeline diagram
• Shows pipeline usage in a single cycle
• Highlight resources used
– c.f.
c f “multi-clock-cycle”
multi clock cycle diagram
• Graph of operation over time
• We’ll
We ll look at “single-clock-cycle”
single clock cycle diagrams
for load & store

©2009, CE Department
dce
2009
IF for Load, Store, …

©2009, CE Department
dce
2009
ID for Load, Store, …

©2009, CE Department
dce
2009
EX for Load

©2009, CE Department
dce
2009
MEM for Load

©2009, CE Department
dce
2009
WB for Load

Wrong
register
number

©2009, CE Department
dce
2009
Corrected Datapath for Load

©2009, CE Department
dce
2009
EX for Store

©2009, CE Department
dce
2009
MEM for Store

©2009, CE Department
dce
2009
WB for Store

©2009, CE Department
dce
2009
Multi-Cycle Pipeline Diagram
• Form showing resource usage

©2009, CE Department
dce
2009
Multi-Cycle Pipeline Diagram
• Traditional form

©2009, CE Department
dce
2009
Single-Cycle Pipeline Diagram
• State of pipeline in a given cycle

©2009, CE Department
dce
2009
Pipelined Control (Simplified)

©2009, CE Department
dce
2009
Pipelined Control
• Control signals derived from instruction
– As in single-cycle
single cycle implementation

©2009, CE Department
dce
2009
Pipelined Control

©2009, CE Department
dce

§4.7 Data
Data Hazards in ALU Instructions
2009

a Hazardss: Forward
• Consider this sequence:
sub $2, $1,$3
$2 $1 $3
and $12,$2,$5
or
o $13,$6,$2
$ 3,$6,$

ding vs. Sta


add $14,$2,$2
sw $15,100($2)

alling
• We can resolve hazards with forwarding
– How
o do we
e detect when
e to forward?
o ad

©2009, CE Department
dce
2009
Dependencies & Forwarding

©2009, CE Department
dce
2009
Detecting the Need to Forward

• Pass register numbers along pipeline


–e
e.g.,
g ID/EX
ID/EX.RegisterRs
RegisterRs = register number for Rs
sitting in ID/EX pipeline register
• ALU operand
p register
g numbers in EX stage
g
are given by
– ID/EX.RegisterRs, ID/EX.RegisterRt
• Data
D t hhazards
d when
h
Fwd from
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM
pipeline reg
1b EX/MEM.RegisterRd
1b. EX/MEM RegisterRd = ID/EX
ID/EX.RegisterRt
RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs Fwd from
MEM/WB
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt pipeline
p p reg
g

©2009, CE Department
dce
2009
Detecting the Need to Forward
• But only if forwarding instruction will write
to a register!
– EX/MEM.RegWrite, MEM/WB.RegWrite
• And onl
only if Rd for that instr
instruction
ction is not
$zero
– EX/MEM.RegisterRd
EX/MEM R i t Rd ≠ 0,
0
MEM/WB.RegisterRd ≠ 0

©2009, CE Department
dce
2009
Forwarding Paths

©2009, CE Department
dce
2009
Forwarding Conditions
• EX hazard
– if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
– if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
• MEM hazard
– if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
– if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

©2009, CE Department
dce
2009
Double Data Hazard
• Consider the sequence:
add $1,$1,$2
$1 $1 $2
add $1,$1,$3
add $1,$1,$4
$ ,$ ,$
• Both hazards occur
– Want to use the most recent
• Revise MEM hazard condition
– Only
O l ffwd
d if EX hazard
h d condition
diti iisn’t
’t ttrue

©2009, CE Department
dce
2009

Revised Forwarding Condition


• MEM hazard
– if ((MEM/WB.RegWrite
g and ((MEM/WB.RegisterRd
g ≠ 0))
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd
(MEM/WB RegisterRd = ID/EX.RegisterRs))
ID/EX RegisterRs))
ForwardA = 01
– if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd
(MEM/WB RegisterRd = ID/EX.RegisterRt))
ID/EX RegisterRt))
ForwardB = 01

©2009, CE Department
dce
2009
Datapath with Forwarding

©2009, CE Department
dce
2009
Load-Use Data Hazard

Need to stall
for one cycle

©2009, CE Department
dce
2009
Load-Use Hazard Detection
• Check when using instruction is decoded
in ID stage
• ALU operand register numbers in ID stage
are ggiven by
y
– IF/ID.RegisterRs, IF/ID.RegisterRt
• Load
Load-use
use hazard when
– ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
• If detected, stall and insert bubble

©2009, CE Department
dce
2009
How to Stall the Pipeline
• Force control values in ID/EX register
to 0
– EX, MEM and WB do nop (no-operation)
• Prevent
Pre ent update
pdate of PC and IF/ID register
– Using instruction is decoded again
– Following instruction is fetched again
– 1-cycle stall allows MEM to read data for lw
• Can subsequently forward to EX stage

©2009, CE Department
dce
2009
Stall/Bubble in the Pipeline

Stall inserted
here

©2009, CE Department
dce
2009
Stall/Bubble in the Pipeline

Or, more
accurately…
©2009, CE Department
dce
2009
Datapath with Hazard Detection

©2009, CE Department
dce
2009
Stalls and Performance
The BIG Picture

• Stalls reduce performance


– But are required to get correct results
• Compiler can arrange code to avoid
hazards and stalls
– Requires knowledge of the pipeline structure

©2009, CE Department
dce
2009
Branch Hazards
• If branch outcome determined in MEM

Flush these
instructions
(Set control
values to 0)

PC

©2009, CE Department
dce
2009
Reducing Branch Delay
• Move hardware to determine outcome to ID
stage
g
– Target address adder
– Register comparator
• Example: branch taken
36: sub $10, $4, $8
40: beq $1
$1, $3,
$3 7
44: and $12, $2, $5
48: or $13, $2, $6
52: add $14,
$14 $4,
$4 $2
56: slt $15, $6, $7
...
72: lw , 50($7)
$4, ( )

©2009, CE Department
dce
2009
Example: Branch Taken

©2009, CE Department
dce
2009
Example: Branch Taken

©2009, CE Department
dce
2009
Data Hazards for Branches
• If a comparison register is a destination of
2nd or 3rd preceding ALU instruction

add $1,
$1 $2,
$2 $3 IF ID EX MEM WB

add $4, $5, $6 IF ID EX MEM WB

… IF ID EX MEM WB

beq $1, $4, target IF ID EX MEM WB

• Can resolve using forwarding

©2009, CE Department
dce
2009
Data Hazards for Branches
• If a comparison register is a destination of
preceding ALU instruction or 2nd preceding
load instruction
– Need 1 stall cycle

lw $1 addr
$1, IF ID EX MEM WB

add $4, $5, $6 IF ID EX MEM WB

beq stalled IF ID

beq $1, $4, target ID EX MEM WB

©2009, CE Department
dce
2009
Data Hazards for Branches
• If a comparison register is a destination of
immediately preceding load instruction
– Need 2 stall cycles

lw $1 addr
$1, IF ID EX MEM WB

beq stalled IF ID

beq stalled ID

beq $1, $0, target ID EX MEM WB

©2009, CE Department
dce
2009
Dynamic Branch Prediction
• In deeper and superscalar pipelines, branch
penalty
p y is more significant
g
• Use dynamic prediction
– Branch pprediction buffer ((aka branch history
y table))
– Indexed by recent branch instruction addresses
– Stores outcome (taken/not taken)
– To execute a branch
• Check table, expect the same outcome
• Start fetching from fall-through or target
• If wrong, flush pipeline and flip prediction

©2009, CE Department
dce
2009
1-Bit Predictor: Shortcoming
• Inner loop branches mispredicted twice!
outer: …

inner: …

beq …, …, inner

beq …, …, outer

– Mispredict as taken on last iteration of


inner loop
– Then mispredict as not taken on first
it ti off inner
iteration i lloop nextt ti
time around
d
©2009, CE Department
dce
2009
2-Bit Predictor
• Only change prediction on two successive
mispredictions

©2009, CE Department
dce
2009
Calculating the Branch Target
• Even with predictor, still need to calculate
the target address
– 1-cycle penalty for a taken branch
• Branch target b
buffer
ffer
– Cache of target addresses
– Indexed by PC when instruction fetched
• If hit and instruction is branch predicted taken, can
fetch target immediately

©2009, CE Department
dce
2009
Exceptions and Interrupts
• “Unexpected” events requiring change
in flow of control
– Different ISAs use the terms differently
• Exception
p
– Arises within the CPU
• e.g., undefined opcode, overflow, syscall, …
• Interrupt
– From an external I/O controller
• Dealing with them without sacrificing
performance is hard

©2009, CE Department
dce
2009
Handling Exceptions
• In MIPS, exceptions managed by a System
Control Coprocessor (CP0)
• Save PC of offending (or interrupted) instruction
– In MIPS: Exception Program Counter (EPC)
• Save indication of the problem
– In MIPS: Cause register
– We
We’llll assume 1
1-bit
bit
• 0 for undefined opcode, 1 for overflow
• Jump to handler at 8000 00180

©2009, CE Department
dce
2009
An Alternate Mechanism
• Vectored Interrupts
– Handler address determined by the cause
• Example:
– Undefined opcode: C000 0000
– Overflow: C000 0020
– …: C000 0040
• Instructions either
– Deal with the interrupt,
interrupt or
– Jump to real handler

©2009, CE Department
dce
2009
Handler Actions
• Read cause, and transfer to relevant
a de
handler
• Determine action required
• If restartable
– Take corrective action
– use EPC to return to program
• Otherwise
– Terminate program
– Report error using EPC, cause, …

©2009, CE Department
dce
2009
Exceptions in a Pipeline
• Another form of control hazard
• Consider overflow on add in EX stage
add $1, $2, $1
– Prevent $1 from being clobbered
– Complete previous instructions
– Flush add and subsequent instructions
– Set Cause and EPC register values
– Transfer control to handler
• Similar to mispredicted branch
– Use much of the same hardware

©2009, CE Department
dce
2009
Pipeline with Exceptions

©2009, CE Department
dce
2009
Exception Properties
• Restartable exceptions
– Pipeline can flush the instruction
– Handler executes, then returns to the
instruction
• Refetched and executed from scratch
• PC saved in EPC register
– Identifies causing instruction
– Actually PC + 4 is saved
• Handler must adjust

©2009, CE Department
dce
2009
Exception Example
• Exception on add in
40 sub $11, $2, $4
44 and $12, $2, $5
48 or $13, $2, $6
4C add $1, $2, $1
50 slt $15, $6, $7
54 lw $16, 50($7)

• Handler
80000180 sw $25, 1000($0)
80000184 sw $26 1004($0)
$26,

©2009, CE Department
dce
2009
Exception Example

©2009, CE Department
dce
2009
Exception Example

©2009, CE Department
dce
2009
Multiple Exceptions
• Pipelining overlaps multiple instructions
– Could have multiple exceptions at once
• Simple approach: deal with exception from
earliest instruction
– Flush subsequent instructions
– “Precise” exceptions
• In complex pipelines
– Multiple instructions issued per cycle
– Out-of-order completion
– Maintaining precise exceptions is difficult!

©2009, CE Department
dce
2009
Imprecise Exceptions
• Just stop pipeline and save state
– Including exception cause(s)
• Let the handler work out
– Which instruction(s)
( ) had exceptions
p
– Which to complete or flush
• May require “manual” completion
• Simplifies hardware, but more complex handler
software
• Not feasible for complex multiple-issue
out-of-order pipelines

©2009, CE Department
dce
2009
Instruction-Level Parallelism (ILP)
• Pipelining: executing multiple instructions in
parallel
• To increase ILP
– Deeper pipeline
• Less
L work
k per stage
t ⇒ shorter
h t clock
l k cycle
l
– Multiple issue
• Replicate pipeline stages ⇒ multiple pipelines
• Start multiple instructions per clock cycle
• CPI < 1, so use Instructions Per Cycle (IPC)
• E.g., 4GHz 4-way multiple-issue
– 16 BIPS, peak CPI = 0.25, peak IPC = 4
• But dependencies reduce this in practice

©2009, CE Department
dce
2009
Multiple Issue
• Static multiple issue
– Compiler groups instructions to be issued together
– Packages them into “issue slots”
– Compiler detects and avoids hazards
• Dynamic multiple issue
– CPU examines instruction stream and chooses
instructions to issue each cycle
– Compiler can help by reordering instructions
– CPU resolves hazards using advanced techniques at
runtime

©2009, CE Department
dce
2009
Speculation
• “Guess” what to do with an instruction
– Start operation as soon as possible
– Check whether guess was right
• If so, complete the operation
• If not, roll-back and do the right thing
• Common to static and dynamic multiple issue
• Examples
E l
– Speculate on branch outcome
• Roll back if path taken is different
– Speculate on load
• Roll back if location is updated

©2009, CE Department
dce
2009
Compiler/Hardware Speculation
• Compiler can reorder instructions
– e.g.,
e g move load before branch
– Can include “fix-up” instructions to recover
from incorrect guess
• Hardware can look ahead for instructions
to execute
– Buffer results until it determines they are
actually needed
– Flush buffers on incorrect speculation

©2009, CE Department
dce
2009
Speculation and Exceptions
• What if exception occurs on a
speculatively
specu at e y eexecuted
ecuted instruction?
st uct o
– e.g., speculative load before null-pointer
check
• Static speculation
– Can add ISA support for deferring exceptions
• Dynamic speculation
– Can buffer exceptions
p until instruction
completion (which may not occur)

©2009, CE Department
dce
2009
Static Multiple Issue
• Compiler groups instructions into “issue
packets
packets”
– Group of instructions that can be issued on a
single cycle
– Determined by pipeline resources required
• Think of an issue packet as a very long
instruction
– Specifies
S ifi multiple
lti l concurrentt operations
ti
– ⇒ Very Long Instruction Word (VLIW)

©2009, CE Department
dce
2009
Scheduling Static Multiple Issue
• Compiler must remove some/all hazards
– Reorder instructions into issue packets
– No dependencies with a packet
– Possibly some dependencies between
packets
• Varies between ISAs; compiler must know!
– Pad with nop if necessary

©2009, CE Department
dce
2009
MIPS with Static Dual Issue
• Two-issue packets
– One ALU/branch instruction
– One load/store instruction
– 64-bit aligned
• ALU/branch, then load/store
• Pad an unused instruction with nop

Address Instruction type Pipeline Stages


n ALU/branch IF ID EX MEM WB
n+4 Load/store IF ID EX MEM WB
n+8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB

©2009, CE Department
dce
2009
MIPS with Static Dual Issue

©2009, CE Department
dce
2009
Hazards in the Dual-Issue MIPS
• More instructions executing in parallel
• EX data hazard
– Forwarding avoided stalls with single-issue
– Now can’t use ALU result in load/store in same
packet
• add $t0, $s0, $s1
load $s2,
$s2 0($t0)
• Split into two packets, effectively a stall
• Load-use hazard
– Still one cycle use latency, but now two instructions
• More aggressive scheduling required

©2009, CE Department
dce
2009
Scheduling Example
• Schedule this for dual-issue MIPS
Loop: lw $t0, 0($s1) # $t0=array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1,–4 # decrement pointer
bne $s1, $zero, Loop # branch $s1!=0

ALU/b
ALU/branch
h L d/ t
Load/store cycle
l
Loop: nop lw $t0, 0($s1) 1
addi $s1, $s1,–4 nop 2
addu
dd $t0,
$ 0 $t0,
$ 0 $s2
$ 2 nop 3
bne $s1, $zero, Loop sw $t0, 4($s1) 4

– IPC = 5/4 = 1.25


1 25 (c
(c.f.
f peak IPC = 2)
©2009, CE Department
dce
2009
Loop Unrolling
• Replicate loop body to expose more
parallelism
– Reduces loop-control overhead
• Use different registers per replication
– Called “register renaming”
– Avoid loop-carried “anti-dependencies”
• Store followed by a load of the same register
• Aka
Ak “name
“ dependence”
d d ”
– Reuse of a register name

©2009, CE Department
dce
2009
Loop Unrolling Example
ALU/branch Load/store cycle
Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1
nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t4, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8

• IPC = 14/8 = 1.75


– Closer to 2, but at cost of registers and code size

©2009, CE Department
dce
2009
Dynamic Multiple Issue
• “Superscalar” processors
• CPU decides whether to issue 0
0, 1
1, 2
2, …
each cycle
– Avoiding
A idi structural
t t l and
dddata
t hhazards
d
• Avoids the need for compiler scheduling
– Though it may still help
– Code semantics ensured by the CPU

©2009, CE Department
dce
2009
Dynamic Pipeline Scheduling
• Allow the CPU to execute instructions out
of order to avoid stalls
– But commit result to registers in order
• Example
E ample
lw $t0, 20($s2)
addu $t1,
$t1 $t0,
$t0 $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
– Can start sub while addu is waiting for lw

©2009, CE Department
dce
2009
Dynamically Scheduled CPU
Preserves
dependencies

Hold pending
operands
d

Results also sent


to any waiting
reservation stations

Reorders buffer for


register writes
Can supplyy
operands for
issued instructions

©2009, CE Department
dce
2009
Register Renaming
• Reservation stations and reorder buffer
effectively
e ect e y pprovide
o de register
eg ste renaming
e a g
• On instruction issue to reservation station
– If operand is available in register file or
reorder buffer
• Copied to reservation station
• No longer required in the register; can be
overwritten
– If operand is not yet available
• It will be provided to the reservation station by a
function unit
• Register update may not be required

©2009, CE Department
dce
2009
Speculation
• Predict branch and continue issuing
– Don
Don’tt commit until branch outcome
determined
• Load speculation
– Avoid load and cache miss delay
• Predict the effective address
• Predict loaded value
• Load before completing outstanding stores
• Bypass stored values to load unit
– Don
Don’tt commit load until speculation cleared

©2009, CE Department
dce
2009
Why Do Dynamic Scheduling?
• Why not just let the compiler schedule
code?
• Not all stalls are predicable
– e.g., cache
h misses
i
• Can’t always schedule around branches
– Branch outcome is dynamically determined
• Different implementations
p of an ISA have
different latencies and hazards

©2009, CE Department
dce
2009
Does Multiple Issue Work?
The BIG Picture

• Yes, but not as much as we’d like


• Programs have real dependencies that limit ILP
• Some dependencies are hard to eliminate
– e.g., pointer aliasing
• Some parallelism is hard to expose
– Limited window size during instruction issue
• Memory delays and limited bandwidth
– Hard to keep pipelines full
• Speculation
S l ti can h
help
l if d
done wellll
©2009, CE Department
dce
2009
Power Efficiency
• Complexity of dynamic scheduling and
speculations requires power
• Multiple simpler cores may be better
Microprocessor Year Clock Rate Pipeline Issue Out-of-order/ Cores Power
Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
UltraSparc T1 2005 1200MHz 6 1 No 8 70W

©2009, CE Department
dce
The Opteron X4 Microarchitecture
2009

72 physical
registers

©2009, CE Department
dce
2009
The Opteron X4 Pipeline Flow
• For integer operations

– FP is 5 stages longer
– Up to 106 RISC-ops in progress
• Bottlenecks
– Complex instructions with long dependencies
– Branch mispredictions
– Memory
M access delays
d l

©2009, CE Department
dce
2009
Fallacies
• Pipelining is easy (!)
– The basic idea is easy
– The devil is in the details
• e.g., detecting data hazards
• Pipelining is independent of technology
– So why haven’t we always done pipelining?
– More transistors make more advanced techniques
feasible
– Pipeline-related ISA design needs to take account of
technology trends
• e.g., predicated instructions

©2009, CE Department
dce
2009
Pitfalls
• Poor ISA design can make pipelining
harder
– e.g., complex instruction sets (VAX, IA-32)
• Significant overhead to make pipelining work
• IA-32 micro-op approach
– e.g.,
e g complex addressing modes
• Register update side effects, memory indirection
– e.g., delayed branches
• Advanced pipelines have long delay slots

©2009, CE Department
dce
2009
Concluding Remarks
• ISA influences design of datapath and control
• Datapath and control influence design of ISA
• Pipelining improves instruction throughput
using parallelism
– More instructions completed per second
– Latencyy for each instruction not reduced
• Hazards: structural, data, control
p issue and dynamic
• Multiple y scheduling
g ((ILP))
– Dependencies limit achievable parallelism
– Complexity leads to the power wall

©2009, CE Department

You might also like