CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
# $ % &n!1'(
or the non0pipelined processor
"
1
# n
1peedup factor
S
#
"
1
"
#
n
$ % &n!1'(
#
n
% &n!1'
7 7
CS211 12
Efciency and Throughput
Efciency and Throughput
)fficiency of the !stages pipeline :
)
#
S
#
n
% &n!1'
Pipeline throughput 4the num)er of tasks per unit time5 :
note e/uivalence to IPC
*
#
n
$ % &n!1'(
#
n f
% &n!1'
10 10
CS211 13
Pipeline Performance# "$ample
(ask has 3 su)tasks with time: t29=>- t*9?>- t39@>- and t39A> ns
4nanoseconds5
latch delay 9 2>
Pipeline cycle time 9 @>82> 9 2>> ns
or non0pipelined e!ecution
time 9 =>8?>8@>8A> 9 *A> ns
1peedup for a)ove case is: *A><2>> 9 *.A BB
Pipeline (ime for 2>>> tasks 9 2>>> 8 3029 2>>3C2>> ns
1e/uential time 9 2>>>C*A>ns
(hroughput9 2>>><2>>3
#hat is the pro)lem here $
6ow to improve performance $
CS211 14
%on&linear pipelines an pipeline
control al!orithms
M
U
X
M
e
m
o
r
y
R
e
g
F
i
l
e
M
U
X
M
U
X
D
a
t
a
M
e
m
o
r
y
M
U
X
!ign
Extend
"
A
d
d
e
r
Zero?
Next SEQ PC
A
d
d
r
e
s
s
Next PC
W Data
!
"
s
t
RD
RS#
RS$
!mm
What do we need to do to pipeline the process ?
CS211 26
+ Steps of ,IPS-'L. 'atapath
Memory
Access
Write
Back
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
A
L
M
e
m
o
r
y
R
e
g
F
i
l
e
M
U
X
M
U
X
D
a
t
a
M
e
m
o
r
y
M
U
X
!ign
Extend
Zero?
I
F
#
I
D
I
D
#
E
$
M
E
M
#
W
B
E
$
#
M
E
M
"
A
d
d
e
r
Next SEQ PC Next SEQ PC
RD RD RD
W
D
a
t
a
% Data stationary control
& local decode %or each instruction &hase # &i&eline stage
Next PC
A
d
d
r
e
s
s
RS#
RS$
!mm
M
U
X
CS211 27
/raphically (epresentin!
Pipelines
DMem I%etch
Reg
Reg
A
L
DMem I%etch
Reg
Reg
A
L
DMem I%etch
Reg
Reg
A
L
J!ec:
'.& operates on the two register operands
&pdate PC
7ulticycle 7achine
2> ns<cycle ! 3.= CPI 4due to inst mi!5 ! 2>> inst 9 3=>> ns
=
&i&elined
d un&i&eline
.ime Cycle
.ime Cycle
C/I stall /i&eline (
de&th /i&eline
!&eedu&
+
=
Inst &er cycles !tall A0erage C/I Ideal C/I
&i&elined
+ =
For sim&le RI!C &i&eline1 C/I 2 (3
CS211 42
4ne ,emory Port-Structural Hazars
I
n
s
t
r.
'
r
d
e
r
.ime 4clock cycles5
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
A
L
DMem I%etch
Reg
Reg
A
L
DMem I%etch
Reg
CS211 43
4ne ,emory Port-Structural Hazars
I
n
s
t
r.
'
r
d
e
r
.ime 4clock cycles5
Load
Instr 1
Instr 2
Stall
Instr 3
Reg
A
L
DMem I%etch
Reg
Reg
A
L
DMem I%etch
Reg
Bu66le Bu66le Bu66le Bu66le Bu66le
CS211 44
"$ample# 'ual&port vs2 Sin!le&port
CP& 9 ICCCPICClk
,
CPI9 ideal CPI 8 stalls
CS211 45
"$ample=
7achine ': %ual ported memory 4+6arvard
'rchitecture,5
7achine B: 1ingle ported memory- )ut its pipelined
implementation has a 2.>? times faster clock rate
Ideal CPI 9 2 for )oth
.oads are 3>O of instructions e!ecuted
1peed&p
'
9 Pipe. %epth<42 8 >5 ! 4clock
unpipe
<clock
pipe
5
9 Pipeline %epth
1peed&p
B
9 Pipe. %epth<42 8 >.3 ! 25 ! 4clock
unpipe
<4clock
unpipe
<
2.>?5
9 4Pipe. %epth<2.35 ! 2.>?
9 >.F? ! Pipe. %epth
1peed&p
'
< 1peed&p
B
9 Pipe. %epth<4>.F? ! Pipe. %epth5 9 2.33
7achine ' is 2.33 times faster
CS211 46
'ata 'epenencies
Pro)lem:
i2: mul r2- r*- r3Q
i*: add r*- r3- r?Q
DMem I%etch
Reg
Reg
A
L
DMem I%etch
Reg
Reg I%etch
A
L
DMem Reg
Bu66le
I%etch
A
L
DMem Reg
Bu66le Reg
I%etch
A
L
DMem
Bu66le
Reg
CS211 67
:hat can we >S-:? o?
CS211 68
(ry producing fast code for
a 9 ) 8 cQ
d 9 e T fQ
assuming a- )- c- d -e- and f in memory.
1low code:
.# R)-)
.# Rc-c
'%% Ra-R)-Rc
1# a-Ra
.# Re-e
.# Rf-f
1&B Rd-Re-Rf
1# d-Rd
Software Scheulin! to Avoi Loa
Hazars
ast code:
.# R)-)
.# Rc-c
.# Re-e
'%% Ra-R)-Rc
.# Rf-f
1# a-Ra
1&B Rd-Re-Rf
1# d-Rd
CS211 69
Control Hazars# *ranches
Instruction flow
1tream of instructions processed )y Inst. etch unit
1peed of +input flow, puts )ound on rate of
outputs generated
DMem I%etch
Reg
Reg
A
L
7IP1 1olution:
7ove Xero test to I%<R stage
'dder to calculate new PC in I%<R stage
2 clock cycle penalty for )ranch versus 3
CS211 72
Pipeline ,IPS >'L.? 'atapath
Memory
Access
Write
Back
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc.
'0is is t0e )orre)t # )y)le
late")y im1leme"tatio"2
CS211 73
3our *ranch Hazar Alternatives
Y2: 1tall until )ranch direction is clear T flushing pipe
Y*: Predict Branch Not (aken
J!ecute successor instructions in se/uence
+1/uash, instructions in pipeline if )ranch actually taken
'dvantage of late pipeline state update
3FO %.I )ranches not taken on average
PC83 already calculated- so use it to get ne!t instruction
Y3: Predict Branch (aken
?3O %.I )ranches taken on average
But havenMt calculated )ranch target address in %.I
, %.I still incurs 2 cycle )ranch penalty
, "ther machines: )ranch target known )efore outcome
CS211 74
3our *ranch Hazar Alternatives
Y3: %elayed Branch
%efine )ranch to take place '(JR a following instruction
bran$% &nstru$t&on
s!"u!nt&al su$$!ssor
1
s!"u!nt&al su$$!ssor
2
''''''''
s!"u!nt&al su$$!ssor
n
bran$% tar(!t &) ta*!n
2 slot delay allows proper decision and )ranch target address in
? stage pipeline
%.I uses this
Branch delay o% length n
CS211 75
CS211 76
'elaye *ranch
#here to get instructions to fill )ranch delay slot$
Before )ranch instruction
rom the target address: only valua)le when )ranch taken
rom fall through: only valua)le when )ranch not taken
Cancelling )ranches allow more slots to )e filled
Compiler effectiveness for single )ranch delay slot:
ills a)out =>O of )ranch delay slots
')out A>O of instructions e!ecuted in )ranch delay slots useful
in computation
')out ?>O 4=>O ! A>O5 of slots usefully filled
%elayed Branch downside: F0A stage pipelines- multiple
instructions issued per clock 4superscalar5
CS211 77
"valuatin! *ranch Alternatives
Schedu+ing 7ranchC82speedup ..speedup .. scheme
pena+ty unpipe+ined sta++
1tall pipeline 32.3*3.? 2.>
Predict taken 22.233.3 2.*=
Predict not taken 22.>@3.? 2.*@
%elayed )ranch >.?2.>F3.= 2.32
Conditional Z &nconditional 9 23O- =?O change PC
Pipeline speedup =
Pipeline depth
1 +Branch frequencyBranch penalty
CS211 78
*ranch Preiction )ase on history
Can we use history of )ranch )ehaviour to predict
)ranch outcome $
1implest scheme: use 2 )it of +history,
1et )it to Predict (aken 4(5 or Predict Not0taken 4N(5
Pipeline checks )it value and predicts
, If incorrect then need to invalidate instruction
'ctual outcome used to set the )it value
J!ample: let initial value 9 (- actual outcome of
)ranches is0 N(- (-(-N(-(-(
Predictions are: (- N(-(-(-N(-(
, 3 wrong 4in red5- 3 correct 9 ?>O accuracy
In general- can have k0)it predictors: more when we
cover superscalar processors.
CS211 79
Summary #
Control an Pipelinin!
#hat is I.P$
Processor and Compiler design techni/ues that
speed up e!ecution )y causing individual machine
operations to e!ecute in parallel