0% found this document useful (0 votes)
105 views82 pages

CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari

This document provides an overview of CS211: Computer Architecture taught by Prof. Bhagi Narahari. It discusses how to improve computer performance through pipelining. Pipelining involves dividing the instruction execution process into stages to overlap computations. This decreases the clock cycle and improves performance by reducing the cycles per instruction count. The document then discusses how to pipeline the 5-stage RISC load and 4-stage RISC ALU instruction pipelines, noting the structural hazard of writing to the register file in the same cycle for both instruction types. It suggests inserting a bubble or stall cycle to resolve the hazard.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views82 pages

CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari

This document provides an overview of CS211: Computer Architecture taught by Prof. Bhagi Narahari. It discusses how to improve computer performance through pipelining. Pipelining involves dividing the instruction execution process into stages to overlap computations. This decreases the clock cycle and improves performance by reducing the cycles per instruction count. The document then discusses how to pipeline the 5-stage RISC load and 4-stage RISC ALU instruction pipelines, noting the structural hazard of writing to the register file in the same cycle for both instruction types. It suggests inserting a bubble or stall cycle to resolve the hazard.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 82

CS211 1

CS 211: Computer Architecture


Instructor: Prof. Bhagi Narahari
Dept. of Computer Science
Course URL:
www.seas.gwu.edu/~narahari/cs211/
CS211 2
How to improve performance?

Recall performance is function of


CPI: cycles per instruction
Clock cycle
Instruction count

Reducing any of the 3 factors will lead to


improved performance
CS211 3
How to improve performance?

irst step is to apply concept of pipelining to


the instruction e!ecution process
"verlap computations

#hat does this do$


%ecrease clock cycle
%ecrease effective CP& time compared to original
clock cycle

'ppendi! ' of (e!t)ook


'lso parts of Chapter *
CS211 4
Pipeline Approach to Improve System
Performance

'nalogous to fluid flow in pipelines and


assem)ly line in factories

%ivide process into +stages, and send tasks


into a pipeline
"verlap computations of different tasks )y
operating on them concurrently in different stages
CS211 5
Instruction Pipeline

Instruction e!ecution process lends itself


naturally to pipelining
overlap the su)tasks of instruction fetch- decode
and e!ecute
CS211 6
Linear Pipeline
Processor
.inear pipeline processes a se/uence of
su)tasks with linear precedence
't a higher level 0 1e/uence of processors
%ata flowing in streams from stage 1
2
to the
final stage 1
k
Control of data flow : synchronous or
asynchronous
1
2
1
*
1
k

3 3
CS211 7
Synchronous Pipeline
'll transfers simultaneous
"ne task or operation enters the pipeline per cycle
Processors reservation ta)le : diagonal
4 4
CS211 8
Time Space Utilization of Pipeline
12
1*
13
(2
(2
(2
(*
(*
(*
(3
(3
(3
2 * 3 3
(ime 4in pipeline cycles5
ull pipeline after 3 cycles
CS211 9
Asynchronous Pipeline

(ransfers performed when individual processors are ready
6andshaking protocol )etween processors
7ainly used in multiprocessor systems with message0passing
5 5
CS211 10
Pipeline Clock an
Timin!
1
i
1
i82

m
d
Clock cycle of the pipeline :
.atch delay : d
9 ma! :
m
; 8 d
Pipeline fre/uency : f
f 9 2 <
6 6
CS211 11
Speeup an "fficiency

!stage pipeline processes n tasks in k 8 4n025 clock
cycles:
cycles for the first task and n!1 cycles
for the remaining n!1 tasks
(otal time to process n tasks
"

# $ % &n!1'(
or the non0pipelined processor
"
1
# n
1peedup factor
S

#
"
1

"


#
n
$ % &n!1'(
#
n
% &n!1'
7 7
CS211 12
Efciency and Throughput
Efciency and Throughput
)fficiency of the !stages pipeline :
)

#
S



#
n
% &n!1'
Pipeline throughput 4the num)er of tasks per unit time5 :
note e/uivalence to IPC
*

#
n
$ % &n!1'(
#
n f
% &n!1'
10 10
CS211 13
Pipeline Performance# "$ample
(ask has 3 su)tasks with time: t29=>- t*9?>- t39@>- and t39A> ns
4nanoseconds5
latch delay 9 2>
Pipeline cycle time 9 @>82> 9 2>> ns
or non0pipelined e!ecution
time 9 =>8?>8@>8A> 9 *A> ns
1peedup for a)ove case is: *A><2>> 9 *.A BB
Pipeline (ime for 2>>> tasks 9 2>>> 8 3029 2>>3C2>> ns
1e/uential time 9 2>>>C*A>ns
(hroughput9 2>>><2>>3
#hat is the pro)lem here $
6ow to improve performance $
CS211 14
%on&linear pipelines an pipeline
control al!orithms

Can have non0linear path in pipelineD


6ow to schedule instructions so they do no conflict
for resources

6ow does one control the pipeline at the


microarchitecture level
6ow to )uild a scheduler in hardware $
6ow much time does scheduler have to make
decision $
CS211 15
%on&linear 'ynamic Pipelines
7ultiple processors 4!stages5 as linear pipeline
Earia)le functions of individual processors
unctions may )e dynamically assigned
eedforward and feed)ack connections
CS211 16
(eservation Ta)les
Reservation ta)le : displays time0space flow of data
through the pipeline analogous to opcode of pipeline
Not diagonal- as in linear pipelines
7ultiple reservation ta)les for different functions
unctions may )e dynamically assigned
eedforward and feed)ack connections
(he num)er of columns in the reservation ta)le :
evaluation time of a given function
CS211 17
Reservation Tables (Examples)
Reservation Tables (Examples)
14 14
CS211 18
Latency Analysis

Latency : the num)er of clock cycles


)etween two initiations of the pipeline

Co++ision : an attempt )y two initiations to use


the same pipeline stage at the same time

1ome latencies cause collision- some not


CS211 19
Collisions (Example)
Collisions (Example)

!
2
!
2
!
*
!
2
!
2
!
*
!
2
!
*
!
3
!
*
!
3
!
2
!
3
!
2
!
*
!
*
!
3
!
3
!
3
!
3
!
2
!
*
!
3
!
*
!
3
!
3
2 * 3 3 ? = F A @ 2>
.atency 9 *
16 16
CS211 20
Latency Cycle
Latency Cycle
!
2
2 * 3 3 ? = F A @ 2> 22 2* 23 23 2? 2= 2F
!
2
!
2
!
2
!
2
!
2
!
2
!
*
!
2
!
*
!
*
!
*
!
*
!
*
!
3
!
*
!
*
!
3
!
3
!
3
!
3
2A
!
3
Cycle Cycle
Latency cyc+e : the se/uence of initiations which has repetitive
su)se/uence and without collisions
Latency se,uence +ength : the num)er of time intervals
within the cycle
-.erage +atency : the sum of all latencies divided )y
the num)er of latencies along the cycle
17 17
CS211 21
Collision Free Scheduling
Collision Free Scheduling
/oa+ : to find the shortest average latency
Lengths : for reservation ta)le with n columns- ma!imum for)idden
latency is m 0# n 1 1- and permissi)le latency p is
1 0# p 0# m 1 1
2dea+ case : p 9 2 4static pipeline5
Co++ision .ector : C # &C
m
C
m!1
. . .C
2
C
1
'

G C
i
# 1 if latency i causes collision H
G C
i
# 3 for permissi)le latencies H
18 18
CS211 22
Collision Vector
Collision Vector
!
!
!
!
!
!
!
2
!
2
!
2
!
2
!
2
!
2
Reservation (a)le
!
*
!
*
!
*
!
*
!
*
!
*
C # &4 4 . . . 4 4'
Ealue I
2

Ealue I
*

19 19
CS211 23
*ack to our focus# Computer
Pipelines

J!ecute )illions of instructions- so


throughput is what matters

7IP1 desira)le features:


all instructions same length-
registers located in same place in instruction
format-
memory operands only in loads or stores
CS211 24
'esi!nin! a Pipeline
Processor

Ko )ack and e!amine your datapath and control


diagram

associated resources with states

ensure that flows do not conflict- or figure out how


to resolve

assert control in appropriate stage


CS211 25
+ Steps of ,IPS 'atapath
Memory
Access
Write
Back
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
L
M
D
A
L

M
U
X
M
e
m
o
r
y
R
e
g

F
i
l
e
M
U
X
M
U
X
D
a
t
a
M
e
m
o
r
y
M
U
X
!ign
Extend
"
A
d
d
e
r
Zero?
Next SEQ PC
A
d
d
r
e
s
s
Next PC
W Data
!
"
s
t
RD
RS#
RS$
!mm
What do we need to do to pipeline the process ?
CS211 26
+ Steps of ,IPS-'L. 'atapath
Memory
Access
Write
Back
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
A
L

M
e
m
o
r
y
R
e
g

F
i
l
e
M
U
X
M
U
X
D
a
t
a
M
e
m
o
r
y
M
U
X
!ign
Extend
Zero?
I
F
#
I
D
I
D
#
E
$
M
E
M
#
W
B
E
$
#
M
E
M
"
A
d
d
e
r
Next SEQ PC Next SEQ PC
RD RD RD
W


D
a
t
a
% Data stationary control
& local decode %or each instruction &hase # &i&eline stage
Next PC
A
d
d
r
e
s
s
RS#
RS$
!mm
M
U
X
CS211 27
/raphically (epresentin!
Pipelines

Can help with answering /uestions like:


how many cycles does it take to e!ecute this code$
what is the '.& doing during cycle 3$
use this representation to help understand datapaths
CS211 28
0isualizin! Pipelinin!
I
n
s
t
r.
'
r
d
e
r
'ime ()lo)* )y)les+
Reg
A
L

DMem I%etch
Reg
Reg
A
L

DMem I%etch
Reg
Reg
A
L

DMem I%etch
Reg
Reg
A
L

DMem I%etch Reg


Cycle ( Cycle ) Cycle * Cycle " Cycle + Cycle , Cycle -
CS211 29
Conventional Pipeline "$ecution
(epresentation
!etch "cd #$ec %e& W'
!etch "cd #$ec %e& W'
!etch "cd #$ec %e& W'
!etch "cd #$ec %e& W'
!etch "cd #$ec %e& W'
!etch "cd #$ec %e& W'
(ro)ra& !low
*i&e
CS211 30
Sin!le Cycle1 ,ultiple Cycle1 vs2
Pipeline
Clk
Cycle 1
Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle Cycle ! Cycle " Cycle # Cycle 1$
%oa& Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem
%oa& 'tore
(ipeline Implementation:
Ifetch Reg Exec Mem Wr 'tore
Clk
'ingle Cycle Implementation:
%oa& 'tore Wa)te
Ifetch
R*type
Ifetch Reg Exec Mem Wr R*type
Cycle 1 Cycle 2
CS211 31
The 3ive Sta!es of
Loa

Ifetch: Instruction etch


etch the instruction from the Instruction 7emory

Reg<%ec: Registers etch and Instruction %ecode

J!ec: Calculate the memory address

7em: Read the data from the %ata 7emory

#r: #rite the data )ack to the register file


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg+,ec Exec Mem Wr %oa&
CS211 32
The 3our Sta!es of (&
type

Ifetch: Instruction etch


etch the instruction from the Instruction 7emory

Reg<%ec: Registers etch and Instruction %ecode

J!ec:
'.& operates on the two register operands
&pdate PC

#r: #rite the '.& output )ack to the register file


Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg+,ec Exec Wr R*type
CS211 33
Pipelinin! the (&type an Loa
Instruction

#e have pipeline conflict or structural haLard:


(wo instructions try to write to the register file at the same
timeB
"nly one write port
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle Cycle ! Cycle " Cycle #
Ifetch Reg+,ec Exec Wr R*type
Ifetch Reg+,ec Exec Wr R*type
Ifetch Reg+,ec Exec Mem Wr %oa&
Ifetch Reg+,ec Exec Wr R*type
Ifetch Reg+,ec Exec Wr R*type
-p). We ha/e a pro0lem.
CS211 34
Important
4)servation

Jach functional unit can only )e used once per


instruction

Jach functional unit must )e used at the same stage


for all instructions:
.oad uses Register ileMs #rite Port during its ?th stage
R0type uses Register ileMs #rite Port during its 3th stage
Ifetch Reg+,ec Exec Mem Wr %oa&
1 2 3 4 5
Ifetch Reg+,ec Exec Wr R*type
1 2 3 4
+ * ways to solve this pipeline haLard.
CS211 35
Solution 5# Insert 6*u))le7 into the
Pipeline

Insert a +)u))le, into the pipeline to prevent * writes


at the same cycle
(he control logic can )e comple!.
.ose instruction fetch and issue opportunity.

No instruction is started in Cycle =B


Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle Cycle ! Cycle " Cycle #
Ifetch Reg+,ec Exec Wr R*type
Ifetch Reg+,ec Exec
Ifetch Reg+,ec Exec Mem Wr %oa&
Ifetch Reg+,ec Exec Wr
R*type
Ifetch Reg+,ec Exec Wr
R*type
(ipeline
1u00le
Ifetch Reg+,ec Exec Wr
CS211 36
Solution 8# 'elay (&type9s :rite )y 4ne
Cycle %elay R0typeMs register write )y one cycle:
Now R0type instructions also use Reg ileMs write port at 1tage ?
7em stage is a N""P stage: nothing is )eing done.
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle Cycle ! Cycle " Cycle #
Ifetch Reg+,ec Mem Wr R*type
Ifetch Reg+,ec Mem Wr R*type
Ifetch Reg+,ec Exec Mem Wr %oa&
Ifetch Reg+,ec Mem Wr R*type
Ifetch Reg+,ec Mem Wr R*type
Ifetch Reg+,ec Exec Wr R*type Mem
Exec
Exec
Exec
Exec
1 2 3 4 5
CS211 37
:hy
Pipeline?

1uppose we e!ecute 2>> instructions

1ingle Cycle 7achine


3? ns<cycle ! 2 CPI ! 2>> inst 9 3?>> ns

7ulticycle 7achine
2> ns<cycle ! 3.= CPI 4due to inst mi!5 ! 2>> inst 9 3=>> ns

Ideal pipelined machine


2> ns<cycle ! 42 CPI ! 2>> inst 8 3 cycle drain5 9 2>3> ns
CS211 38
:hy Pipeline? *ecause the resources are
there;
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Inst >
Inst 2
Inst *
Inst 3
Inst 3
2
%
3
Im Reg ,m Reg
2
%
3
Im Reg ,m Reg
2
%
3
Im Reg ,m Reg
2
%
3
Im Reg ,m Reg
2
%
3
Im Reg ,m Reg
CS211 39
Pro)lems with Pipeline processors?
.imits to pipelining: 6aLards prevent ne!t instruction from
e!ecuting during its designated clock cycle and introduce
stall cycles which increase CPI
1tructural haLards: 6# cannot support this com)ination of
instructions 0 two dogs fighting for the same )one
%ata haLards: Instruction depends on result of prior
instruction still in the pipeline
, %ata dependencies
Control haLards: Caused )y delay )etween the fetching of
instructions and decisions a)out changes in control flow
4)ranches and Numps5.
, Control dependencies
Can always resolve haLards )y stalling
7ore stall cycles 9 more CP& time 9 less performance
Increase performance 9 decrease stall cycles
CS211 40
*ack to our ol frien# CPU time
e<uation
Recall e/uation for CP& time
1o what are we doing )y pipelining the instruction
e!ecution process $
Clock $
Instruction Count $
CPI $
, 6ow is CPI effected )y the various haLards $
CS211 41
Spee Up "<uation for Pipelinin!
&i&elined
d un&i&eline
.ime Cycle
.ime Cycle

C/I stall /i&eline C/I Ideal
de&th /i&eline C/I Ideal
!&eedu&
+

=
&i&elined
d un&i&eline
.ime Cycle
.ime Cycle

C/I stall /i&eline (
de&th /i&eline
!&eedu&
+
=
Inst &er cycles !tall A0erage C/I Ideal C/I
&i&elined
+ =
For sim&le RI!C &i&eline1 C/I 2 (3
CS211 42
4ne ,emory Port-Structural Hazars
I
n
s
t
r.
'
r
d
e
r
.ime 4clock cycles5
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch
Reg
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


Cycle ( Cycle ) Cycle * Cycle " Cycle + Cycle , Cycle -
Reg
A
L

DMem I%etch
Reg
CS211 43
4ne ,emory Port-Structural Hazars
I
n
s
t
r.
'
r
d
e
r
.ime 4clock cycles5
Load
Instr 1
Instr 2
Stall
Instr 3
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch
Reg
Reg
A
L

DMem I%etch Reg


Cycle ( Cycle ) Cycle * Cycle " Cycle + Cycle , Cycle -
Reg
A
L

DMem I%etch
Reg
Bu66le Bu66le Bu66le Bu66le Bu66le
CS211 44
"$ample# 'ual&port vs2 Sin!le&port

7achine ': %ual ported memory 4+6arvard


'rchitecture,5

7achine B: 1ingle ported memory- )ut its


pipelined implementation has a 2.>? times faster
clock rate

Ideal CPI 9 2 for )oth

Note 0 .oads will cause stalls of 2 cycle

Recall our friend:

CP& 9 ICCCPICClk
,
CPI9 ideal CPI 8 stalls
CS211 45
"$ample=
7achine ': %ual ported memory 4+6arvard
'rchitecture,5
7achine B: 1ingle ported memory- )ut its pipelined
implementation has a 2.>? times faster clock rate
Ideal CPI 9 2 for )oth
.oads are 3>O of instructions e!ecuted
1peed&p
'
9 Pipe. %epth<42 8 >5 ! 4clock
unpipe
<clock
pipe
5
9 Pipeline %epth
1peed&p
B
9 Pipe. %epth<42 8 >.3 ! 25 ! 4clock
unpipe
<4clock
unpipe
<
2.>?5
9 4Pipe. %epth<2.35 ! 2.>?
9 >.F? ! Pipe. %epth
1peed&p
'
< 1peed&p
B
9 Pipe. %epth<4>.F? ! Pipe. %epth5 9 2.33
7achine ' is 2.33 times faster
CS211 46
'ata 'epenencies

(rue dependencies and alse dependencies


false implies we can remove the dependency
true implies we are stuck with itB

(hree types of data dependencies defined in


terms of how succeeding instruction depends
on preceding instruction
R'#: Read after #rite or low dependency
#'R: #rite after Read or anti0dependency
#'#: #rite after #rite
CS211 47

Read 'fter #rite 4R'#5


Instr
P
tries to read operand )efore Instr
I
writes it

Caused )y a +%ependence, 4in compiler


nomenclature5. (his haLard results from an actual
need for communication.
Three /eneric 'ata Hazars
I: add r1,r2,r3
J: sub r4,r1,r3
CS211 48
(A: 'epenency

J!ample program 4a5 with two instructions


i2: load r2- aQ
i*: add r*- r2-r2Q

Program 4)5 with two instructions


i2: mul r2- r3- r?Q
i*: add r*- r2- r2Q

Both cases we cannot read in i* until i2 has


completed writing the result
In 4a5 this is due to +oad!use dependency
In 4)5 this is due to define!use dependency
CS211 49

#rite 'fter Read 4#'R5


Instr
P
writes operand 5efore Instr
I
reads it

Called an +anti0dependence, )y compiler writers.


(his results from reuse of the name +r2,.

CanMt happen in 7IP1 ? stage pipeline )ecause:

'll instructions take ? stages- and

Reads are always in stage *- and

#rites are always in stage ?


I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Three /eneric 'ata Hazars
CS211 50
Three /eneric 'ata Hazars

#rite 'fter #rite 4#'#5


Instr
P
writes operand 5efore Instr
I
writes it.

Called an +output dependence, )y compiler writers


(his also results from the reuse of name +r2,.

CanMt happen in 7IP1 ? stage pipeline )ecause:

'll instructions take ? stages- and

#rites are always in stage ?

#ill see #'R and #'# in later more complicated


pipes
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
CS211 51
:A( an :A: 'epenency

J!ample program 4a5:


i2: mul r2- r*- r3Q
i*: add r*- r3- r?Q

J!ample program 4)5:


i2: mul r2- r*- r3Q
i*: add r2- r3- r?Q

)oth cases we have dependence )etween i2 and i*


in 4a5 due to r* must )e read )efore it is written into
in 4)5 due to r2 must )e written )y i* after it has )een written
into )y i2
CS211 52
:hat to o with :A( an :A: ?

Pro)lem:
i2: mul r2- r*- r3Q
i*: add r*- r3- r?Q

Is this really a dependence<haLard $


CS211 53
:hat to o with :A( an :A:

1olution: Rename Registers


i2: mul r2- r*- r3Q
i*: add r=- r3- r?Q

Register renaming can solve many of these


fa+se dependencies
note the role that the compiler plays in this
specifically- the register allocation process00i.e.- the
process that assigns registers to varia)les
CS211 54
Hazar 'etection in H-:
1uppose instruction i is a)out to )e issued and a
predecessor instruction 6 is in the instruction pipeline
6ow to detect and store potential haLard information
Note that haLards in machine code are )ased on register
usage
Reep track of results in registers and their usage
, Constructing a register data flow graph
or each instruction i construct set of Read registers
and #rite registers
Rregs4i5 is set of registers that instruction i reads from
#regs4i5 is set of registers that instruction i writes to
&se these to define the 3 types of data haLards
CS211 55
Hazar 'etection in
Harware
' R'# haLard e!ists on register if Rregs4 i 5 #regs4 6 5
Reep a record of pending writes 4for instSs in the pipe5 and
compare with operand regs of current instruction.
#hen instruction issues- reserve its result register.
#hen on operation completes- remove its write reservation.
' #'# haLard e!ists on register if #regs4 i 5 #regs4 6 5
' #'R haLard e!ists on register if #regs4 i 5 Rregs4 6 5
CS211 56
Internal 3orwarin!# /ettin! ri of
some hazars

In some cases the data needed )y the ne!t


instruction at the '.& stage has )een
computed )y the '.& 4or some stage defining
it5 )ut has not )een written )ack to the
registers

Can we +forward, this result )y )ypassing


stages $
CS211 57
I
n
s
t
r.
'
r
d
e
r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


'ata Hazar on (5
.ime 4clock cycles5
IF ID#RF E$ MEM WB
CS211 58
.ime 4clock cycles5
3orwarin! to Avoi 'ata Hazar
I
n
s
t
r.
'
r
d
e
r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


CS211 59
Internal 3orwarin! of Instructions

orward result from '.&<J!ecute unit to e!ecute


unit in ne!t stage

'lso can )e used in cases of memory access


in some cases- operand fetched from memory has )een
computed previously )y the program
can we +forward, this result to a later stage thus
avoiding an e!tra read from memory $
#ho does this $
Internal forwarding cases
1tage i to 1tage i8k in pipeline
store0load forwarding
load0store forwarding
store0store forwarding
CS211 60
Internal 'ata
3orwarin!
1tore0load forwarding
7emory
7
'ccess &nit
R
2
R
*

1(" 7-R2 .% R*-7
7emory
7
'ccess &nit
R
2
R
*

1(" 7-R2 7"EJ R*-R2
38 38
CS211 61
Internal 'ata
3orwarin!
.oad0load forwarding
7emory
7
'ccess &nit
R
2
R
*

.% R2-7 .% R*-7
7emory
7
'ccess &nit
R
2
R
*

.% R2-7 7"EJ R*-R2
39 39
CS211 62
Internal 'ata
3orwarin!
1tore0store forwarding
7emory
7
'ccess &nit
R
2
R
*

1(" 7- R2 1(" 7-R*
7emory
7
'ccess &nit
R
2
R
*

1(" 7-R*
40 40
CS211 63
H: Chan!e for 3orwarin!
M
E
M
,
W
R
!
D
,
E
X
E
X
,
M
E
M

Data
Memory
A
L
U
m
-
x
m
-
x
R
e
g
i
s
t
e
r
s
NextPC
!mmediate
m
-
x
CS211 64
:hat a)out memory
operations?
' B
op Rd Ra R)
op Rd Ra R)
Rd
to reg
file
R
Rd
- If instructions are initiated in order and
operations always occur in the same stage-
there can )e no haLards )etween memory
operationsB
- #hat does delaying #B on arithmetic
operations cost$
T cycles $
T hardware $
- #hat a)out data dependence on loads$
R2 U0 R3 8 R?
R* U0 7emG R* 8 I H
R3 U0 R* 8 R2
+%elayed .oads,
- Can recogniLe this in decode stage and
introduce )u))le while stalling fetch stage
- (ricky situation:
R2 U0 7emG R* 8 I H
7emGR3833H U0 R2
6andle with )ypass in memory stageB
%
%e&
(
CS211 65
.ime 4clock cycles5
I
n
s
t
r.
'
r
d
e
r
lw r1, 0(r2
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
'ata Hazar "ven with 3orwarin!
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch
Reg
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


CS211 66
'ata Hazar "ven with 3orwarin!
'ime ()lo)* )y)les+
or rA-r2-r@
!
"
s
t
r.
/
r
d
e
r
lw r2- >4r*5
su) r3-r2-r=
and r=-r2-rF
Reg
A
L

DMem I%etch
Reg
Reg I%etch
A
L

DMem Reg
Bu66le
I%etch
A
L

DMem Reg
Bu66le Reg
I%etch
A
L

DMem
Bu66le
Reg
CS211 67
:hat can we >S-:? o?
CS211 68
(ry producing fast code for
a 9 ) 8 cQ
d 9 e T fQ
assuming a- )- c- d -e- and f in memory.
1low code:
.# R)-)
.# Rc-c
'%% Ra-R)-Rc
1# a-Ra
.# Re-e
.# Rf-f
1&B Rd-Re-Rf
1# d-Rd
Software Scheulin! to Avoi Loa
Hazars
ast code:
.# R)-)
.# Rc-c
.# Re-e
'%% Ra-R)-Rc
.# Rf-f
1# a-Ra
1&B Rd-Re-Rf
1# d-Rd
CS211 69
Control Hazars# *ranches

Instruction flow
1tream of instructions processed )y Inst. etch unit
1peed of +input flow, puts )ound on rate of
outputs generated

Branch instruction affects instruction flow


%o not know ne!t instruction to )e e!ecuted until
)ranch outcome known

#hen we hit a )ranch instruction


Need to compute target address 4where to )ranch5
Resolution of )ranch condition 4true or false5
7ight need to VflushM pipeline if other instructions
have )een fetched for e!ecution
CS211 70
Control Hazar on *ranches
Three Sta!e Stall
10: b!" r1,r3,36
14: and r2,r3,r#
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch
Reg
Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


Reg
A
L

DMem I%etch Reg


CS211 71
*ranch Stall Impact

If CPI 9 2- 3>O )ranch-


1tall 3 cycles 9W new CPI 9 2.@B

(wo part solution:


%etermine )ranch taken or not sooner- 'N%
Compute taken )ranch address earlier

7IP1 )ranch tests if register 9 > or >

7IP1 1olution:
7ove Xero test to I%<R stage
'dder to calculate new PC in I%<R stage
2 clock cycle penalty for )ranch versus 3
CS211 72
Pipeline ,IPS >'L.? 'atapath
Memory
Access
Write
Back
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc.
'0is is t0e )orre)t # )y)le
late")y im1leme"tatio"2
CS211 73
3our *ranch Hazar Alternatives
Y2: 1tall until )ranch direction is clear T flushing pipe
Y*: Predict Branch Not (aken
J!ecute successor instructions in se/uence
+1/uash, instructions in pipeline if )ranch actually taken
'dvantage of late pipeline state update
3FO %.I )ranches not taken on average
PC83 already calculated- so use it to get ne!t instruction
Y3: Predict Branch (aken
?3O %.I )ranches taken on average
But havenMt calculated )ranch target address in %.I
, %.I still incurs 2 cycle )ranch penalty
, "ther machines: )ranch target known )efore outcome
CS211 74
3our *ranch Hazar Alternatives
Y3: %elayed Branch
%efine )ranch to take place '(JR a following instruction
bran$% &nstru$t&on
s!"u!nt&al su$$!ssor
1
s!"u!nt&al su$$!ssor
2
''''''''
s!"u!nt&al su$$!ssor
n
bran$% tar(!t &) ta*!n
2 slot delay allows proper decision and )ranch target address in
? stage pipeline
%.I uses this
Branch delay o% length n
CS211 75
CS211 76
'elaye *ranch
#here to get instructions to fill )ranch delay slot$
Before )ranch instruction
rom the target address: only valua)le when )ranch taken
rom fall through: only valua)le when )ranch not taken
Cancelling )ranches allow more slots to )e filled
Compiler effectiveness for single )ranch delay slot:
ills a)out =>O of )ranch delay slots
')out A>O of instructions e!ecuted in )ranch delay slots useful
in computation
')out ?>O 4=>O ! A>O5 of slots usefully filled
%elayed Branch downside: F0A stage pipelines- multiple
instructions issued per clock 4superscalar5
CS211 77
"valuatin! *ranch Alternatives
Schedu+ing 7ranchC82speedup ..speedup .. scheme
pena+ty unpipe+ined sta++
1tall pipeline 32.3*3.? 2.>
Predict taken 22.233.3 2.*=
Predict not taken 22.>@3.? 2.*@
%elayed )ranch >.?2.>F3.= 2.32
Conditional Z &nconditional 9 23O- =?O change PC
Pipeline speedup =
Pipeline depth
1 +Branch frequencyBranch penalty
CS211 78
*ranch Preiction )ase on history
Can we use history of )ranch )ehaviour to predict
)ranch outcome $
1implest scheme: use 2 )it of +history,
1et )it to Predict (aken 4(5 or Predict Not0taken 4N(5
Pipeline checks )it value and predicts
, If incorrect then need to invalidate instruction
'ctual outcome used to set the )it value
J!ample: let initial value 9 (- actual outcome of
)ranches is0 N(- (-(-N(-(-(
Predictions are: (- N(-(-(-N(-(
, 3 wrong 4in red5- 3 correct 9 ?>O accuracy
In general- can have k0)it predictors: more when we
cover superscalar processors.
CS211 79
Summary #
Control an Pipelinin!

Pust overlap tasksQ easy if tasks are independent

1peed &p Pipeline %epthQ if ideal CPI is 2- then:

6aLards limit performance on computers:


1tructural: need more 6# resources
%ata 4R'#-#'R-#'#5: need forwarding- compiler
scheduling
Control: delayed )ranch- prediction
&i&elined
d un&i&eline
.ime Cycle
.ime Cycle

C/I stall /i&eline (
de&th /i&eline
!&eedu&
+
=
CS211 80
Summary @5-8#
Pipelinin!

#hat makes it easy


all instructions are the same length
Nust a few instruction formats
memory operands appear only in loads and stores

#hat makes it hard$ 6'X'R%1B


structural haLards: suppose we had only one memory
control haLards: need to worry a)out )ranch instructions
data haLards: an instruction depends on a previous
instruction

Pipelines pass control information down the pipe Nust


as data moves down pipe

orwarding<1talls handled )y local control

J!ceptions stop the pipeline


CS211 81
Introuction to ILP

#hat is I.P$
Processor and Compiler design techni/ues that
speed up e!ecution )y causing individual machine
operations to e!ecute in parallel

I.P is transparent to the user


7ultiple operations e!ecuted in parallel even though
the system is handed a single program written with
a se/uential processor in mind

1ame e!ecution hardware as a normal RI1C


machine
7ay )e more than one of any given type of hardware
CS211 82
Frontend and Optimizer
Determine Dependences
Determine Independences
Bind Operations to Function Units
Bind Transports to Busses
Determine Dependences
Bind Transports to Busses
Execute
Superscalar
Dataflow
Indep. Arch.
VLIW
TTA
Compiler Hardware
Determine Independences
Bind Operations to Function Units
14 Ramakri)hna Rau an& 5o)eph 24 6i)her4 In)truction*le/el parallel: 7i)tory o/er/ie89 an& per)pecti/e4 :he
5ournal of 'upercomputing9 !;1*2<:#*5$9 May 1##34
Compiler vs2 Processor

You might also like