ECE/CS 752: Advanced Computer Architecture I 1
ECE/CS 752: Advanced Computer Architecture I 1
1
1
Revisit Amdahls Law Revisit Amdahls Law
Sequentialbottleneck
Evenifvisinfinite
Performancelimitedbynonvectorizable
f
v
f
f
v
1
1
1
1
lim
y
portion(1f)
No. of
Processors
N
Time
1
h 1- h
1- f
f
ECE/CS 752: Advanced Computer Architecture I 2
Pipelined Performance Model Pipelined Performance Model
Pipeline
Depth
N
1
g=fractionoftimepipelineisfilled
1g=fractionoftimepipelineisnotfilled
(stalled)
1-g g
1
Pipeline
Depth
N
1
Pipelined Performance Model Pipelined Performance Model
g=fractionoftimepipelineisfilled
1g=fractionoftimepipelineisnotfilled
(stalled)
1-g g
1
Pipelined Performance Model Pipelined Performance Model
Pipeline
Depth
N
1
TyrannyofAmdahlsLaw[BobColwell]
Whengisevenslightlybelow100%,abig
performancehitwillresult
Stalledcyclesarethekeyadversaryandmustbe
minimizedasmuchaspossible
1-g g
1
Motivation for Superscalar Motivation for Superscalar
[Agerwala and Cocke] [Agerwala and Cocke]
5
6
7
8
p
n=12
n=100
Speedupjumpsfrom3to4.3
forN=6,f=0.8,buts=2instead
ofs=1(scalar)
0 0.2 0.4 0.6 0.8 1
0
1
2
3
4
5
Vectorizability f
S
p
e
e
d
u
p
p
n=4
n=6
n=6,s=2
Typical Range
Superscalar Proposal Superscalar Proposal
ModeratetyrannyofAmdahlsLaw
Easesequentialbottleneck
Moregenerallyapplicable g y pp
Robust(lesssensitivetof)
RevisedAmdahlsLaw:
v
f
s
f
Speedup
1
1
Limits on Instruction Level Limits on Instruction Level
Parallelism (ILP) Parallelism (ILP)
WeissandSmith[1984] 1.58
Sohi andVajapeyam[1987] 1.81
TjadenandFlynn[1970] 1.86(Flynns bottleneck)
TjadenandFlynn[1973] 1.96
Uht[1986] 2.00
Smithet al. [1989] 2.00 Smithet al. [1989] 2.00
J ouppi andWall [1988] 2.40
J ohnson[1991] 2.50
Acostaet al. [1986] 2.79
Wedig[1982] 3.00
Butler et al. [1991] 5.8
MelvinandPatt [1991] 6
Wall [1991] 7(J ouppi disagreed)
Kuck et al. [1972] 8
RisemanandFoster [1972] 51(nocontrol dependences)
NicolauandFisher [1984] 90(Fishers optimism)
ECE/CS 752: Advanced Computer Architecture I 3
Superscalar Proposal Superscalar Proposal
Gobeyondsingleinstructionpipeline,
achieveIPC>1
Dispatchmultipleinstructionspercycle
id ll li bl f f Providemoregenerallyapplicableformof
concurrency(notjustvectors)
Gearedforsequentialcodethatishardto
parallelizeotherwise
Exploitfinegrainedorinstructionlevel
parallelism(ILP)
Classifying ILP Machines Classifying ILP Machines
[Jouppi,DECWRL1991]
BaselinescalarRISC
Issueparallelism=IP=1
Operationlatency=OP=1
PeakIPC=1
1
2
3
4
5
6
IF DE EX WB
1 2 3 4 5 6 7 8 9 0
TIME IN CYCLES (OF BASELINE MACHINE)
S
U
C
C
E
S
S
I
V
E
I
N
S
T
R
U
C
T
I
O
N
S
Classifying ILP Machines Classifying ILP Machines
[Jouppi,DECWRL1991]
Superpipelined:cycletime=1/mofbaseline
Issueparallelism=IP=1inst/minorcycle
Operationlatency=OP=mminorcycles
P k IPC i t / j l ( d ?) PeakIPC=minstr/majorcycle(mxspeedup?)
1
2
3
4
5
IF DE EX WB
6
1 2
3 4 5 6
Classifying ILP Machines Classifying ILP Machines
[Jouppi,DECWRL1991]
Superscalar:
Issueparallelism=IP=ninst/cycle
Operationlatency=OP=1cycle
PeakIPC=ninstr/cycle(nxspeedup?) / y ( p p )
IF DE EX WB
1
2
3
4
5
6
9
7
8
Classifying ILP Machines Classifying ILP Machines
[Jouppi,DECWRL1991]
VLIW:VeryLongInstructionWord
Issueparallelism=IP=ninst/cycle
Operationlatency=OP=1cycle
PeakIPC=ninstr/cycle=1VLIW/cycle / y / y
IF DE
EX
WB
Classifying ILP Machines Classifying ILP Machines
[Jouppi,DECWRL1991]
SuperpipelinedSuperscalar
Issueparallelism=IP=ninst/minorcycle
Operationlatency=OP=mminorcycles
PeakIPC=nxminstr/majorcycle / j y
IF DE EX WB
1
2
3
4
5
6
9
7
8
ECE/CS 752: Advanced Computer Architecture I 4
Superscalar vs. Superpipelined Superscalar vs. Superpipelined
Roughlyequivalentperformance
Ifn=mthenbothhaveaboutthesameIPC
Parallelismexposedinspacevs.time
Timein Cycles (of BaseMachine)
0 1 2 3 4 5 6 7 8 9
SUPERPIPELINED
10 11 12 13
SUPERSCALAR
Key:
IFetch
Dcode
Execute
Writeback
Superpipelining Superpipelining: Result Latency : Result Latency
Superpipelining - J ouppi, 1989
essentially describes apipelined execution stage
J ouppi s basemachine J ouppi s basemachine
Underpipelined machine
Superpipelined machine
Underpipelined machines cannot
issue instructions as fast as they are
executed
Note - key charact eristic of Superpipe lined
machines is that results are not available
to M-1 suc cess ive instructions
Superscalar Challenges Superscalar Challenges
I-cache
FETCH
DECODE
Branch
Predictor
Instruction
Buffer
Instruction
Flow
DECODE
COMMIT
D-cache Store
Queue
Reorder
Buffer
Integer Floating-point Media Memory
Register
Data
Memory
Data
EXECUTE
(ROB)
Flow
Flow