Itanium: An EPIC Architecture
Itanium: An EPIC Architecture
An EPIC Architecture
Marco Barcella
Karthik Sankaranarayanan
Ganesh Pai
ITANIUM
Introduction
EPIC: Explicitly Parallel Instruction Computing
Combination of features of RISC and VLIW
VLIW features and flaws
ITANIUM
Introduction
733 - 800 MHz clock
0.18-micron CMOS process technology
2 extended, 2 single precision FMACs
Execution up to 8 SP flops/cycle - 6 GFLOP
>20x Pentium Pro
3-level cache hierarchy
Split L1 and Unified L2 on die
Unified L3 on separate die but same
container
ITANIUM
Introduction
Die Plot
ITANIUM
Marco Barcella
ITANIUM
Outline
Introduction to the ISA
Expressing parallelism
Creating parallelism
Techniques and instructions
Compatibility
Observations
ITANIUM
Techniques
Predication
Speculation
Large register files
register rotation
HW exception deferral
Software pipelining
Register Resources
NaT
Predicate Registers
Branch Registers
AR: PFS(PFM, PEC)
UM
8 Kernel Registers
LC, EC, CCV
AR 16-19
Future definition
ITANIUM
Encoding
ITANIUM
10
Instructions
6 types, 4units
L+ X : Long branches, long immediate integer
ITANIUM
11
Expressing Parallelism
Not only bundles, but also
- Compound Conditionals
If ((a==0)|| (b<=5) ||
(c!=d) || (f & 0x2)
{ r3 = 8 };
= 0,a
= 0,t
= 0,d
1,f,1;;
- Multi-way branches
{ .mii
cmp.ne p1,p2 = r1,r2;
cmp.ne p3,p4 = 4, r5;
cmp.lt p5,p6 = r8,r9;
}
{ .bbb
(p1) br.cond label1
(p3) br.cond label2
(p5) br.call b4 = label3
}
// Fall through code here
(p1) mov r3 = 8
ITANIUM
12
Creating Parallelism
Predication
Uses CMP instructions and predicate registers
Converts control dependencies to data dependencies
Motivation
if (r1==r2)
r9 = r10 r11;
else
r5 = r6 + r7;
Speculation + Predication
Basic blocks in a single group
Barriers between basic blocks
Compiler
ITANIUM
13
Control Speculation
Importance of loads
ld.s and chk.s and handling exceptions
Propagation of token and fix-up
ITANIUM
14
Data Speculation
Ambiguous dependencies, ld.a
How it works
ALAT, two tags
Two recoveries
ld.c, ldf.c, ldfp.c
chk.a (chk.s)
ITANIUM
15
Procedure Calls
ITANIUM
16
Procedure Calls
RSE speculatively fills and spills in the
background
Result: Vs. PA-RISC 30%, 5% (Database)
ITANIUM
17
ITANIUM
18
Branch Instruction
Three categories
IP-relative (21 bit) ; Long (60 bit) ; Indirect (in BRs)
ITANIUM
19
Branch Instructions
ITANIUM
20
10
Software Pipelining
Motivation
Vs HW
Parallelism
3 phases
Rotating FR, PR
LC, EC
ITANIUM
21
Software Pipelining
2 categories
Counted,
While (top, exit)
Counted
Ends with EC=1 and LC=0, no qualifying predicate
While
No LC, ends when QP=0 and EC=1
ITANIUM
22
11
ITANIUM
23
Deallocate
ITANIUM
24
12
LOCATION
TARGET
IMPORTANCE
STRATEGY
ITANIUM
25
Memory Instructions
Simple (GR or FR, memory access order)
Variants for speculative, spilling
Semaphore instructions
ITANIUM
26
13
Memory Instructions
ITANIUM
27
ITANIUM
28
14
Compare Instructions
Two predicate registers
Deferred token (tnat)
5 types
Normal,
Unconditional
3 parallel compares
ITANIUM
29
ITANIUM
30
15
Compatibility
X86: direct execution
BR.IA, JMPE, overhead of register set saving
SSE included (128), new media
MMX parallel arithmetic: 128 not 8
HP dynamic translator
CMP4
ITANIUM
31
Code Density
Causes
Avg. 43 bit (32 of RISC)
Added (alloc, chk)
Fix-up
Biggest impact
Decreasing hit rate on caches
ITANIUM
32
16
Observations
Synergetic
ld.sa, data dependences in software pipelining
Compiler
Template
Grouping
Explicit prefetching
ld.a
33
Instruction Stream
The Processor Front-end
Ganesh Pai
ITANIUM
34
17
Instruction Stream
Overview of EPIC hardware
I-Stream
Pipeline
I-Cache
Prefetch & Fetch
Branch prediction
Issue (Instruction dispersal & delivery)
ITANIUM
35
ITANIUM
36
18
ITANIUM
37
Pipeline Features
6-wide EPIC hardware under precise compiler
control
10-stage in-order pipeline
Dynamic support for run-time optimization
Ensure high throughput
ITANIUM
38
19
I Cache ; I TLB
16 Kb
4-way set associative
Fully pipelined
64-entry I-TLB
Single cycle
Fully associative
On-chip page walker
39
40
20
Decoupling buffer
8 bundles deep
Hides stalls, cache misses, branch mispredictions
ITANIUM
41
Branch Prediction
First emphasis on compiler
Reducing branches by predication
ITANIUM
42
21
Branch Prediction
43
Branch Prediction
Resteer1 : Single Cycle Predictor
4 TAR s programmed by compiler with important
hints
TAR is a 4 deep FIFO
On a hit branch is predicted taken
ITANIUM
22
Branch Prediction
Resteer3 & 4
Two branch address calculators (BAC1 and BAC2)
Correction to earlier predictions (if any)
A special perfect-exit-loop-predictor
ITANIUM
45
Instruction Dispersal
ITANIUM
46
23
Instruction Dispersal
Stop bits eliminate dependency checking
Templates simplify routing
Map instructions to first available of 9 issue
ports
Keep issuing until stop bit
Resource over-subscription or asymmetry
47
Instruction Delivery
Register Stacking
Achieved transparently to the compiler
Register re-mapping via parallel adders
ITANIUM
48
24
Data Stream
The Execution Core
Karthik Sankaranarayanan
ITANIUM
49
4
4
2
2
3
ALU
MMX
+ 2 FMAC
Load/ Store
branch
Issue Ports
ITANIUM
2
2
2
3
I
M
F
B
50
25
Register Files
Integer
128 64-bit
8 read ports (2 x 2 I units, 2 x 2 M units)
6 write ports (1 x 2 I units, 2 x 2 Loads - A.I)
Floating Point
128 82-bit (double extended)
8 read ports (2 x 2 F units, 2 x 2 M units)
4 write ports (2 x 2 F units, 2 x 2 M units)
Predicate
64 1-bit , broadside R/W
15 read ports (2 x 6 - M, F, I units & 3B units)
11 write ports
(2 x 2 M units, 2 x 2 I units, 2 x 1 F unit, 1 x 1 Reg.
Rot.)
ITANIUM
51
ITANIUM
52
26
Register Scoreboard
Hazard detection
Stall only dependent instructions
Include predicates
cmp.eq
cmp.eq r1,r2
r1,r2 -->
--> p1,p3
p1,p3
(p1)
(p1) ld4[r3]
ld4[r3] -->
--> r4
r4
add
add r4,
r4, r1
r1 -->
--> r5
r5 (no
(no dependence
dependence if
if p1=0)
p1=0)
Defer stalls
ITANIUM
53
Operand Delivery
Deferred Stall
54
27
Execution
Deferred Stall
Execute
Writes turned off at retirement for false predicates
Different latencies - Out Of Order Execution
In-order retire - scoreboard
cmp.eq
cmp.eq r1,r2
r1,r2 -->
--> p1,p3
p1,p3
cmp.eq
cmp.eq r7,r8
r7,r8 -->
--> p5,p7
p5,p7
(p1)
(p1) ld4[r3]
ld4[r3] -->
--> r4
r4 (reads
(reads p1
p1 in
in EXE)
EXE)
(p5)
(p5) add
add r4,
r4, r1
r1 -->
--> r5
r5 (reads
(reads p5
p5 in
in REG)
REG)
Predicates
Producer reads in EXE
Consumer reads in REG
ITANIUM
55
Execution
Predicates
Forward as soon as possible
Minimize forwarding logic
Predicate generation - deterministic latency
56
28
57
ld.s, chk.s
Exception Deferral - NaTs, NaTVals (poison bits!)
Store NaTs? - store.spill, ld.fill (context switch)
UNaT, RNaT
Data Speculation
ld.a, chk.a, ld.c
ld.c can be issued with dependent instructions
ALAT - 32 entries, Register ID, Address, Size
58
29
FPU Details
ITANIUM
59
Memory Subsystem
Address translation
32 entry L1 DTLB, 96 entry L2 DTLB, Page size 4K - 256 M
Regions for sharing, , Keys for protection
Hardware page walker
ITANIUM
60
30
Memory Subsystem
L1 Data
16 K, 4-way, 32 byte lines
write through, no write allocate
dual ported, 2 cycle load latency
61
Memory Subsystem
Caches
Hints
FP NT1 = Int NT2
Bias - Easier MESI
ITANIUM
62
31
64 bit, 2.1GB/s,
Multidrop , Split transaction bus
Up to 56 outstanding transactions
Optimized MESI protocol
Glue-less multiprocessor support (Up to 4)
IA 32 control
ECC/Parity coverage of processor and bus
Read only structures - parity
Data - ECC.
ITANIUM
63
ITANIUM
64
32
ITANIUM
65
Conclusions
To Sum Up
ITANIUM
66
33
Conclusions
67
34