Cpe 242 Computer Architecture and Engineering Instruction Level Parallelism

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 46

cs 152 ilp.

1
DAP & SIK 1995
CpE 242
Computer Architecture and Engineering
Instruction Level Parallelism
cs 152 ilp.2
DAP & SIK 1995
Recap: Interconnection Network Implementation Issues
Interconnect MPP LAN WAN
Example CM-5 Ethernet ATM
Maximum length 25 m 500 m; copper: 100 m
between nodes 5 repeaters
optical: 1000 m
Number data lines 4 1 1
Clock Rate 40 MHz 10 MHz 155.5 MHz
Shared vs. Switch Switch Shared Switch
Maximum number 2048 254 > 10,000 of nodes
Media Material Copper Twisted pair Twisted pair
copper wire copper wire or
or Coaxial optical fiber
cable
cs 152 ilp.3
DAP & SIK 1995
Recap: Implementation Issues
Advantages of Serial vs. Parallel lines:
No synchronizing signals
Higher clock rate and longer distance than parallel lines. (e.g., 60
MHz x 256 bits x 0.5 m vs. 155 MHz x 1 bit x 100 m)
- Imperfections in the copper wires or integrated circuit pad
drivers can cause skew in the arrival of signals, limiting the
clock rate, and the length and number of the parallel lines.
Switched vs. Shared Media: pairs communicate at same time: point-to-
point connections
cs 152 ilp.4
DAP & SIK 1995
Recap: Other Interconnection Network Issues
Interconnect MPP LAN WAN
Example CM-5 Ethernet ATM
Topology Fat tree Line Variable,
constructed
from multistage
switches
Connection based? No No Yes
Data Transfer Size Variable: Variable: Fixed:
4 to 20B 0 to 1500B 48B
cs 152 ilp.5
DAP & SIK 1995
Recap: Network Performance Measures
Overhead: latency of interface vs. Latency: network
cs 152 ilp.6
DAP & SIK 1995
Recap: Interconnection Network Summary
Communication between computers
Packets for standards, protocols to cover normal and abnormal events
Implementation issues: length, width, media
Performance issues: overhead, latency, bisection BW
Topologies: many to chose from, but (SW) overheads make them look
the alike; cost issues in topologies
cs 152 ilp.7
DAP & SIK 1995
Outline of Todays Lecture
Recap (5 minutes)
Introduction to Instruction Level Parallelism (15 minutes)
Superpipeline, superscalar, VLIW
Register renaming (5 minutes)
Out-of-order execution(5 minutes)
Branch Prediction (5 minutes)
Limits to ILP (15 minutes)
Summary (5 minutes)
cs 152 ilp.8
DAP & SIK 1995
Advanced Pipelining and Instruction Level Parallelism
gcc 17% control transfer => 5 instructions + 1 branch
=> beyond single block to get more instruction level
parallelism
Loop level parallelism one opportunity, SW and HW
cs 152 ilp.9
DAP & SIK 1995
What's going on in the loop
Basic Loop:
load a <- Ai
load y <- Yi
mult m <- a*s
add r <- m+y
store Ai <- r
inc Ai
inc Yi
dec i
branch
about 9 inst.
per 2 FP ops
Unrolled Loop:
load,load,
mult, add,
store
load,load
mult, add,
store
load,load
mult,
add,store
load,load,
mult, add,
store
inc,inc, dec,
branch
about 6 inst. per 2
FP ops
dependencies
between
instructions remain.

Reordered Unrolled
Loop:
load, load,
load, . . .
mult, mult,
mult, mult,
add, add,
add, add,
store, store,
store, store
inc, inc,
dec,
branch
schedule 24 inst basic
block relative to pipeline
- delay slots
- function unit stalls
- multiple function units
- pipeline depth
cs 152 ilp.10
DAP & SIK 1995
Software Pipelining
Observation: if iterations from loops are independent,
then can get ILP by taking instructions from different
iterations
Software pipelining: reorganizs loops such that each
iteration is made from instructions chosen from different
iterations of the original loop ( Tomasulo in SW)
Iteration
0
Iteration
1
Iteration
2 Iteration
3 Iteration
4
Software-
pipelined
iteration
cs 152 ilp.11
DAP & SIK 1995
SW Pipelining Example
Before: Unrolled 3 times
1 LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 LD F6,-8(R1)
5 ADDD F8,F6,F2
6 SD -8(R1),F8
7 LD F10,-16(R1)
8 ADDD F12,F10,F2
9 SD -16(R1),F12
10 SUBI R1,R1,#24
11 BNEZ R1,LOOP
After: Software Pipelined
1 SD 0(R1),F4 ; Stores M[i]
2 ADDD F4,F0,F2 ; Adds to M[i-1]
3 LD F10,-16(R1); loads M[i-2]
4 SUBI R1,R1,#16
5 BNEZ R1,LOOP
Symbolic Loop Unrolling
Less code space
Overhead paid only once
vs. each iteration in loop unrolling
cs 152 ilp.12
DAP & SIK 1995
How can the machine exploit available ILP?
Technique
Pipelining

Super-pipeline
- Issue 1 instr. / (fast) cycle
- IF takes multiple cycles
Super-scalar
- Issue multiple scalar
instructions per cycle
VLIW
- Each instruction specifies
multiple scalar operations

Limitation
Issue rate,
FU stalls,
FU depth

Clock skew,
FU stalls,
FU depth


Hazard resolution


Packing
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
Ex M W
Ex M W
Ex M W
cs 152 ilp.13
DAP & SIK 1995
Case Study: MIPS R4000 (100 MHz to 200 MHz)
8 stage pipeline:
IFfirst half of fetching of instruction; PC selection happens
here as well as initiation of instruction cache access.
ISsecond half of access to instruction cache.
RFinstruction decode and register fetch, hazard checking
and also instruction cache hit detection.
EXexecution, which includes effective address calculation,
ALU operation, and branch target computation and condition
evaluation.
DFdata fetch, first half of access to data cache.
DSsecond half of access to data cache.
TCtag check, determine whether the data cache access hit.
WBwrite back for loads and register-register operations.
8 stages & impact on Load delay? Branch delay? Why?
cs 152 ilp.14
DAP & SIK 1995
R4000 Performance
Not ideal CPI of 1:
Load stalls (1 or 2 clock cycles)
Branch stalls (2 cycles + unfilled slots)
FP result stalls: RAW data hazard (latency)
FP structural stalls: Not enough FP hardware (parallelism)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
e
q
n
t
o
t
t
e
s
p
r
e
s
s
o
g
c
c
l
i
d
o
d
u
c
n
a
s
a
7
o
r
a
s
p
i
c
e
2
g
6
s
u
2
c
o
r
t
o
m
c
a
t
v
Base Load stalls Branch stalls FP result stalls FP structural
stalls
cs 152 ilp.15
DAP & SIK 1995
Issues raised by Superscalar execution
Available parallelism

Resources and available bandwidth

Branch prediction

Hazard detection and (aggressive)
resolution
- out-of-order issue => WAR and
WAW
- register renaming to avoid false
dependies
- out-of-order completion

Exception handling

Instruction
Fetch
Decode
Instruction
Window
Execution
Units
Must look ahead
and prefetch
instructions
Issue 0 - N
instructions
to Ex. Unit
according to
some policy
cs 152 ilp.16
DAP & SIK 1995
Hardware Schemes for Instruction Parallelism
Why in HW at run time?
Works when cant know dependence at run time
compiler simpler
code for one machine runs well on another
Key idea: Allow instructions behind stall to proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
enables out-of-order execution => out-of-order completion
ID stage checked both for structural execution divides ID
stage:
1. Issuedecode instructions, check for structural hazards
2. Read operandswait until no data hazards, then read operands
Scoreboards allow instruction to execute whenever 1 & 2
hold, not waiting for prior instructions
cs 152 ilp.17
DAP & SIK 1995
Scoreboard (CDC 6600)
Mem +
op Ra ? Rb ? Rd S1 S2
*
(0)
(1) (2) (3)
r1 <- M[r1 + r2]
r2 <- r2 * r3
r4 <- r2 + r5
r2 <- r0
Instruction
op Ra ? Rb ? Rd S1 S2 op Ra ? Rb ? Rd S1 S2
Issue inst to FU when free
and no pending updates in dest.
- hold till registers available
(pick them up while waiting)
- OF, Ex, when ready
- update Scoreboard on WB
unit producing value
cs 152 ilp.18
DAP & SIK 1995
Scoreboard implications
Out-of-order completion => WAR, WAW hazards?
Solutions for WAR
queue both the operation and copies of its operands
read registers only during Read Operands stage
For WAW, must detect hazard: stall until other completes
Need to have multiple instructions in execution phase =>
multiple execution units or pipelined execution units
Scoreboard keeps track of dependencies, state or
operations
Scoreboard replaces ID, EX, WB with 4 stages:
Issue/ID, Read Operands, EX, WB
cs 152 ilp.19
DAP & SIK 1995
Tomosulo
+
Source
Station
(0)
Value or
Source Tag
(Station or
Load Buffer)
Op code
*
Status
MEM
Source
Station
r1 <- r0 + M[r1 + r2]
r2 <- r2 * r3
r4 <- r2 + r5
r2 <- r0
Distributed resolution
- copy available args when issued
- forward pending ops directly from
FUs
cs 152 ilp.20
DAP & SIK 1995
Register Renaming
Mapping
Table
Architecturally
Defined
Registers
Instruction
Large Internal
Register File
Operand Fetch
All source registers renamed through the map

On issue:
Assign new pseudo register for the destination
Update the map
- applies to all following instructions unti the next store
With a large register set, compiler can rename to eliminate WAR
- sometimes requires moves
- HW can do it on the fly (but it can't look at the rest of the program)
cs 152 ilp.21
DAP & SIK 1995
Exceptions and Out-of-order Completion
OOC important when FU (including memory) takes many cycles
- allow independent instructions to flow through other FUs
L1: r1 <- (r2 + A)
r3 <- (r2 + B)
r4 <- r1 +F r3
r2 <- r2 + 8
r5 <- r5 - 1
(r2 + C) < r4
BNZ r5, l1
MIPS solution:
- 3 independent destinations: Int Reg, HI/LO, FP reg
- Check for possible exceptions before any following inst.
modify state (at WB)
Stall if exception is possible
- Moves from one register space explicit

cs 152 ilp.22
DAP & SIK 1995
HW support for More ILP
Speculation: allow instruction is not taken (HW undo)
Often try to combine with dynamic scheduling
Tomasulo: separate speculative bypassing of results from
real bypassing of results
When instruction no longer speculative, write results
(instruction commit)
executeNeed HW buffer for results of uncommitted
instructions: reorder buffer
Reorder buffer can be operand source
Once operand commits, result is found in register
3 fields: instr. type, destination, value
Use reorder buffer number instead of reservation station
cs 152 ilp.23
DAP & SIK 1995
Reorder Buffers
Instruction
Reorder
Buffer
Register Number
Keep track of pending updates to register
- in parallel with register file access,
do (prioritized) associative lookup in reorder buffer
- hit says register file is old,
- reorder buffer provides new value
- RB gives FU that new value should be bypassed from.
Updates go to reorder buffer
- retired to register file when instruction completes (e.g., in order)
Execution Unit
Register File
cs 152 ilp.24
DAP & SIK 1995
Review:Tomasulo Summary
Registers not the bottleneck
Avoids the WAR, WAW hazards of Scoreboard
Not limited to basic blocks (provided branch prediction)
Allows loop unrolling in HW
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
Next stop: More branch prediction
cs 152 ilp.25
DAP & SIK 1995
Dynamic Branch Prediction
Performance = f(accuracy, cost of misprediction)
Branch Historylower bits of PC address index table of 1-
bit values
says whether or not branch taken last time
Problem: in a loop, 1-bit BHT will cause 2 mispredictions:
1) end of loop case, when it exits instead of looping as before
2) first time through loop on next time through code, when it
predicts exit instead of looping
Solution: 2-bit scheme where change prediction only if get
misprediction twice: (Figure 5.13, p. 284)
T
T
T
T
NT
NT
NT
NT
Predict Taken
Predict Not
Taken



Predict Taken
Predict Not
Taken
cs 152 ilp.26
DAP & SIK 1995
BHT Accuracy
Mispredict because either:
wrong guess for that branch
got branch history of wrong branch when index the table
4096 entry table programs vary from 1% misprediction
(nasa7,tomcatv) to 18% (eqntott), with spice at 9% and gcc
at 12%
4096 about as good as infinite table, but 4096 is a lot of
HW
cs 152 ilp.27
DAP & SIK 1995
Correlating Branches
Idea: taken/not taken of a recently executed branches is
related to behavior of next branch (as well as the history
of that branch behavior)
Then behavior of recent branches selects between, say, 4
predictions of next branch, updating just that prediction
Branch address
2-bit per branch predictors
Prediction
2-bit global branch history
cs 152 ilp.28
DAP & SIK 1995
Accuracy of Different Schemes: Mispredictions

F
r
e
q
u
e
n
c
y

o
f

M
i
s
p
r
e
d
i
c
t
i
o
n
s
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
n
a
s
a
7
m
a
t
r
i
x
3
0
0
t
o
m
c
a
t
v
d
o
d
u
c
d
s
p
i
c
e
f
p
p
p
p
g
c
c
e
s
p
r
e
s
s
o
e
q
n
t
o
t
t
l
i
0%
1%
5%
6% 6%
11%
4%
6%
5%
1%
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
cs 152 ilp.29
DAP & SIK 1995
Getting CPI < 1: Issuing Multiple Instr/Cycle
2 variations:
Superscalar: varying no. instructions/cycle (1 to 8),
scheduled by compiler or by HW (Tomasulo)
IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100
Very Long Instruction Words (VLIW): fixed number of
instructions (16) scheduled by the compiler
Joint HP/Intel agreement in 1997 (P86?)?
cs 152 ilp.30
DAP & SIK 1995
Easy Superscalar
Int Reg Inst Issue
and Bypass
FP Reg
Int Unit
I-Cache
Load /
Store
Unit
FP Add FP Mul
D-Cache
Issue integer and FP operations in parallel !

- potential hazards?

- expected speedup?

- what combinations of instructions make sense?
cs 152 ilp.31
DAP & SIK 1995
Getting CPI < 1: Issuing Multiple Instr/Cycle
Superscalar: 2 instructions, 1 FP & 1 anything else
=> Fetch 64-bits/clock cycle; Int on left, FP on right
=> Can only issue 2nd instruction if 1st instruction issues
=> More ports for FP registers to do FP load & FP op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay expands to 3 instruction in SS
instruction in right half cant use it, nor instructions in next
slot

cs 152 ilp.32
DAP & SIK 1995
Unrolled Loop that minimizes stalls for scalar
1 Loop: LD F0,0(R1)
2 LD F6,-8(R1)
3 LD F10,-16(R1)
4 LD F14,-24(R1)
5 ADDD F4,F0,F2
6 ADDD F8,F6,F2
7 ADDD F12,F10,F2
8 ADDD F16,F14,F2
9 SD 0(R1),F4
10 SD -8(R1),F8
11 SD -16(R1),F12
12 SUBI R1,R1,#32
13 BNEZ R1,LOOP
14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
cs 152 ilp.33
DAP & SIK 1995
Loop Unrolling in SuperScalar
Integer instruction FP instruction Clock cycle
Loop: LD F0,0(R1) 1
LD F6,-8(R1) 2
LD F10,-16(R1) ADDD F4,F0,F2 3
LD F14,-24(R1) ADDD F8,F6,F2 4
LD F18,-32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD -8(R1),F8 ADDD F20,F18,F2 7
SD -16(R1),F12 8
SD -24(R1),F16 9
SUBI R1,R1,#40 10
BNEZ R1,LOOP 11
SD -32(R1),F20 12
Unrolled 5 times to avoid delays (+1 due to SS)
12 clocks, or 2.4 clocks per iteration

cs 152 ilp.34
DAP & SIK 1995
Limits of SuperScalar
While Integer/FP split is simple for the HW, get CPI of 0.5
only for programs with:
Exactly 50% FP operations
No hazards
If more instructions issue at same time, greater difficulty
of decode and issue
Even 2-scalar => examine 2 opcodes, 6 register specifiers, &
decide is 1 or 2 instructions can issue
VLIW: tradeoff instruction space for simple decoding
The long instruction word has room for many operations
By definition, all the operations the compiler puts in the long
instruction word can execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
Need compiling technique that schedules across several
branches
cs 152 ilp.35
DAP & SIK 1995
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8
SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration
Need more registers inVLIW
What happens with next generation? Will old code work?
cs 152 ilp.36
DAP & SIK 1995
Limits to Multi-Issue Machines
Inherent limitations of ILP
1 branch in => 5-way VLIW busy?
Latencies of units=> many operations must be scheduled
Need about Pipeline Depth x No. Functional Units of
independentDifficulties in building HW
Duplicate FUs to get parallel execution
Increase ports to Register File (VLIW example needs 7 read
and 3 write for Int. Reg. & 5 read and 3 write for FP reg)
Increase ports to memory
Decoding SS and impact on clock rate, pipeline depth
Limitations specific to either SS or VLIW implementation
Decode issue in SS
VLIW code size: unroll loops + wasted fields in VLIW
VLIW lock step => 1 hazard & all instructions stall
VLIW & binary compatibility
cs 152 ilp.37
DAP & SIK 1995
Exploring Limits to ILP
Conflicting studies of amountBenchmarks (vectorized
Fortran FP vs. integer C programs)
Hardware sophistication
Compiler sophistication
Initial HW Model here; MIPS compilers
1. Register renaminginfinite virtual registers and all WAW &
WAR hazards are avoided
2. Branch predictionperfect; no mispredictions
3. J ump predictionall jumps perfectly predicted => machine
with perfect speculation & an unbounded buffer of
instructions available
4. Memory-address alias analysisaddresses are known & a
store can be moved before a load provided addresses
1 cycle latency for all instructions
cs 152 ilp.38
DAP & SIK 1995
Upper Limit to ILP

Programs
I
n
s
t
r
u
c
t
i
o
n

I
s
s
u
e
s

p
e
r

c
y
c
l
e
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doducd tomcatv
54.8
62.6
17.9
75.2
118.7
150.1
cs 152 ilp.39
DAP & SIK 1995
Program
I
n
s
t
r
u
c
t
i
o
n

i
s
s
u
e
s

p
e
r

c
y
c
l
e
0
10
20
30
40
50
60
gcc espresso li f pppp doducd tomcatv
35
41
16
61
58
60
9
12
10
48
15
6
7
6
46
13
45
6 6 7
45
14
45
2 2
2
29
4
19
46
Perf ect Selective predictor Standard 2-bit Static None
More Realistic HW: Branch Impact
Change from Infinite window to examine to 2000 and
maximum issue of 64 instructions per clock cycle
Profile BHT (512) Pick Cor. or BHT Perfect
cs 152 ilp.40
DAP & SIK 1995
Program
I
n
s
t
r
u
c
t
i
o
n

i
s
s
u
e
s

p
e
r

c
y
c
l
e
0
10
20
30
40
50
60
gcc espresso li f pppp doducd tomcatv
11
15
12
29
54
10
15
12
49
16
10
13
12
35
15
44
9
10
11
20
11
28
5 5
6
5 5
7
4
4
5
4
5 5
59
45
Inf inite 256 128 64 32 None
More Realistic HW: Register Impact
Change 2000 instr window, 64 instr issue, 8K 2level
Prediction
64 None 256 Infinite 32 128
cs 152 ilp.41
DAP & SIK 1995
Program
I
n
s
t
r
u
c
t
i
o
n

i
s
s
u
e
s

p
e
r

c
y
c
l
e
0
5
10
15
20
25
30
35
40
45
50
gcc espresso li f pppp doducd tomcatv
10
15
12
49
16
45
7 7
9
49
16
4
5
4 4
6
5
3
5
3
3
4 4
45
Perf ect Global/stack Perf ect Inspection None
More Realistic HW: Alias Impact
Change 2000 instr window, 64 instr issue, 8K 2level
Prediction, 256 renaming registers
None
Global/Stack perf;
heap conflicts
Perfect
Inspec.
Assem.
cs 152 ilp.42
DAP & SIK 1995
Realistic HW for 9X: Issue Window Impact
Perfect disambiguation (HW), 1K Selective Prediction,
16 entry return, 64 registers, issue as many as window
64 16 256
Infinite
32 128
Program
I
n
s
t
r
u
c
t
i
o
n

i
s
s
u
e
s

p
e
r

c
y
c
l
e
0
10
20
30
40
50
60
gcc expresso li f pppp doducd tomcatv
10
15
12
52
17
56
10
15
12
47
16
10
13
11
35
15
34
9
10
11
22
12
8
8
9
14
9
14
6 6
6
8
7
9
4 4
4
5
4
6
3 2
3
3
3
3
45
22
Inf inite 256 128 64 32 16 8 4
8
4
cs 152 ilp.43
DAP & SIK 1995
Benchmark
S
P
E
C
M
a
r
k
s
0
100
200
300
400
500
600
700
800
900
e
s
p
r
e
s
s
ol
i
e
q
n
t
o
t
t
c
o
m
p
r
e
s
s
s
c
g
c
c
s
p
i
c
e
d
o
d
u
c
m
d
l
j
d
p
2
w
a
v
e
5
t
o
m
c
a
t
v
o
r
a
a
l
v
i
n
n
e
a
r
m
d
l
j
s
p
2
s
w
m
2
5
6
s
u
2
c
o
r
h
y
d
r
o
2
d
n
a
s
a
f
p
p
p
p
8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe)
vs. 2-scalar DEC Alpha @ 200 MHz (7 stage pipe)
Braniac vs. Speed Demon (1994)
IBM
DEC
cs 152 ilp.44
DAP & SIK 1995
HW support for More ILP
Avoid branch prediction by turning branches into
conditionally executed instructions:
if (x) then A = B op C else NOP
If false, then neither store result or cause exception
Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move; PA-RISC can annul any following instr.
Drawbacks to conditional instructions
Still takes a clock even if annulled
Stall if condition evaluated late
Complex conditions make hard for conditional operation
cs 152 ilp.45
DAP & SIK 1995
Summary
Instruction Level Parallelism in SW or HW
Loop level parallelism is easiest to see
SW dependencies/Compiler sophistication determine if compiler
can unroll loops
SW Pipelining
Symbolic Loop Unrolling to get most from pipeline with little code
expansion, little overhead
HW unrolling
Scoreboard & Tomasulo=> Register renaming, reorder
Branch Prediction
Branch History Table: 2 bits for loop accuracy
Correlation: Recently executed branches correlated with next branch
SuperScalar and VLIW
CPI < 1
Dynamic issue vs. Static issue
More instructions issue at same time, larger the penalty of hazards
Future? Stay tuned
cs 152 ilp.46
DAP & SIK 1995
To probe further
links to corperate home-pages and press releases.
http://infopad.eecs.berkeley.edu/CIC/

You might also like