0% found this document useful (0 votes)

22 views

Itanium: An EPIC Architecture

The document provides an overview of the Itanium processor architecture, which uses Explicitly Parallel Instruction Computing (EPIC). It discusses key aspects of the Itanium including its 733-800 MHz clock speed, 64-bit computing capabilities, 3-level cache hierarchy, and instruction set architecture. The document is divided into sections on the instruction stream, data stream, and IA-32 compatibility features of the Itanium.

Uploaded by

Radnum

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Itanium: An EPIC Architecture

Uploaded by

Radnum

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

ITANIUM

An EPIC Architecture
Marco Barcella
Karthik Sankaranarayanan
Ganesh Pai

ITANIUM

Introduction
EPIC: Explicitly Parallel Instruction Computing
Combination of features of RISC and VLIW
VLIW features and flaws

ITANIUM

Groups of independent instructions

Simple hardware
Exploit ILP with compiler
Large increase in code size
Blocking caches

Introduction
733 - 800 MHz clock
0.18-micron CMOS process technology
2 extended, 2 single precision FMACs
Execution up to 8 SP flops/cycle - 6 GFLOP
>20x Pentium Pro
3-level cache hierarchy
Split L1 and Unified L2 on die
Unified L3 on separate die but same
container
ITANIUM

Introduction

64-byte line size

Page Sizes up to 256MB
Full 64-Bit computing
Full IA- 32 binary compatibility in hardware
Shared Resources: ALU, registers, Data
Cache
IA-32 Engine: Dynamic execution

Instruction set architecture (Marco)

Instruction stream (Ganesh)
Data stream and IA-32 Compatibility (Karthik)
ITANIUM

Die Plot

ITANIUM

Instruction Set Architecture

The Software Interface

Marco Barcella

ITANIUM

Outline
Introduction to the ISA
Expressing parallelism
Creating parallelism
Techniques and instructions
Compatibility
Observations
ITANIUM

Why & How

Goal
Bring ILP features to a general purpose
microprocessor, flexibility

Techniques

Predication
Speculation
Large register files
register rotation
HW exception deferral
Software pipelining

RISC/CISC basic architecture of HPs PA-RISC,

but
ITANIUM

UM
8 Kernel Registers
LC, EC, CCV
AR 16-19
Future definition

ITANIUM

Encoding

Bundles: More than one per cycle

Template: MII, MIB other combinations

Compiler based reordering
No Register analysis
Instruction compared to 32-bit

ITANIUM

Instructions
6 types, 4units
L+ X : Long branches, long immediate integer

ITANIUM

Expressing Parallelism
Not only bundles, but also
- Compound Conditionals
If ((a==0)|| (b<=5) ||
(c!=d) || (f & 0x2)
{ r3 = 8 };

cmp.ne p1 = r0, r0;

add t = -5, b;;
cmp.eq.or p1
cmp.ge.or p1
cmp.ne.or p1
tbit.or p1 =

= 0,a
= 0,t
= 0,d
1,f,1;;

- Multi-way branches
{ .mii
cmp.ne p1,p2 = r1,r2;
cmp.ne p3,p4 = 4, r5;
cmp.lt p5,p6 = r8,r9;
}
{ .bbb
(p1) br.cond label1
(p3) br.cond label2
(p5) br.call b4 = label3
}
// Fall through code here

(p1) mov r3 = 8
ITANIUM

Creating Parallelism
Predication
Uses CMP instructions and predicate registers
Converts control dependencies to data dependencies
Motivation
if (r1==r2)
r9 = r10 r11;
else
r5 = r6 + r7;

cmp.eq p1,p2 = r1, r2;;

(p1) sub r9 = r10, r11
(p2) add r5 = r6, r7

Speculation + Predication
Basic blocks in a single group
Barriers between basic blocks
Compiler

ITANIUM

Control Speculation
Importance of loads
ld.s and chk.s and handling exceptions
Propagation of token and fix-up

ITANIUM

Data Speculation
Ambiguous dependencies, ld.a
How it works
ALAT, two tags

Two recoveries
ld.c, ldf.c, ldfp.c
chk.a (chk.s)

ITANIUM

Procedure Calls

Criticism: Large registers

GR: 32 static + 96 stack
Frames(SPARC), local, output
br.call, brl.call & then br.ret
CFM in PFM (PFS), RRB, alloc (sof, sol)

ITANIUM

Procedure Calls
RSE speculatively fills and spills in the
background
Result: Vs. PA-RISC 30%, 5% (Database)

ITANIUM

Context Switch Instructions

Specific control on stack and backing store
Flushrs to spill previous stack frames
Cover to create a new frame above
Ladrs to fill from backing store

ITANIUM

Branch Instruction
Three categories
IP-relative (21 bit) ; Long (60 bit) ; Indirect (in BRs)

ITANIUM

Branch Instructions

ITANIUM

Software Pipelining
Motivation
Vs HW
Parallelism
3 phases
Rotating FR, PR
LC, EC
ITANIUM

Software Pipelining
2 categories
Counted,
While (top, exit)

Counted
Ends with EC=1 and LC=0, no qualifying predicate

While
No LC, ends when QP=0 and EC=1

ITANIUM

Branch Prediction Hints

Hints, Branch Predict Instructions (brp)
Hints:
strategy

ITANIUM

Branch Prediction Hints

Prefetch

Deallocate

ITANIUM

Branch Prediction Hints

Branch prediction instructions

LOCATION
TARGET
IMPORTANCE
STRATEGY

ITANIUM

Memory Instructions
Simple (GR or FR, memory access order)
Variants for speculative, spilling
Semaphore instructions

ITANIUM

Memory Instructions

ITANIUM

Integer and Shifting

Add, add1, addp (32bit)
Shift Left Mask Merge: dep, dep.z
Position and field by immediate
Simple shl (amount)

ITANIUM

Compare Instructions
Two predicate registers
Deferred token (tnat)
5 types
Normal,
Unconditional
3 parallel compares

ITANIUM

Floating Point Architecture

FSR: precision modes, 4 status fields
All with FMAC= A*B+C: simple,divide
XMA
82 bits: 2+ 32(if single), 64(double),
80(double extended)
Two singles in one register

ITANIUM

Compatibility
X86: direct execution
BR.IA, JMPE, overhead of register set saving
SSE included (128), new media
MMX parallel arithmetic: 128 not 8
HP dynamic translator
CMP4
ITANIUM

Code Density
Causes
Avg. 43 bit (32 of RISC)
Added (alloc, chk)
Fix-up

Biggest impact
Decreasing hit rate on caches

ITANIUM

Observations
Synergetic
ld.sa, data dependences in software pipelining

Compiler

Template
Grouping
Explicit prefetching
ld.a

X86 common SW base (aggressive)

20/30% improvement over RISC is claimed
ITANIUM

Instruction Stream
The Processor Front-end

Ganesh Pai

ITANIUM

Instruction Stream
Overview of EPIC hardware
I-Stream

Pipeline
I-Cache
Prefetch & Fetch
Branch prediction
Issue (Instruction dispersal & delivery)

ITANIUM

Overview of EPIC Hardware

ITANIUM

10 Stage In-order Core Pipeline

ITANIUM

Pipeline Features
6-wide EPIC hardware under precise compiler
control
10-stage in-order pipeline
Dynamic support for run-time optimization
Ensure high throughput

Register scoreboard to enforce dependencies

ITANIUM

I Cache ; I TLB

16 Kb
4-way set associative
Fully pipelined
64-entry I-TLB
Single cycle
Fully associative
On-chip page walker

I-Cache filters prefetch requests

Both enhanced with an additional port
To check for a miss
ITANIUM

Fetch & Prefetch

Speculative fetching
Both hardware and software prefetching
Software initiated instruction prefetch

Triggered by BPR hints

Fetch from L2 into instruction-streaming buffer (ISB)
Eight 32-byte entries in the ISB
Short 64-byte bursts / long sequential stream

Eliminate I-fetch bubbles

ITANIUM

Fetch & Prefetch

Decoupling buffer
8 bundles deep
Hides stalls, cache misses, branch mispredictions
ITANIUM

Branch Prediction
First emphasis on compiler
Reducing branches by predication

Branch Prediction for remaining cases

Assisted by branch hint directives i.e
branch target addresses
Static hints on branch direction
Indications for use of dynamic predictor

Hierarchy of branch predictors

ITANIUM

Branch Prediction

Branch hints + Predictor Hierarchy

Four progressive Resteers
Improved branch prediction
ITANIUM

Branch Prediction
Resteer1 : Single Cycle Predictor
4 TAR s programmed by compiler with important
hints
TAR is a 4 deep FIFO
On a hit branch is predicted taken

Resteer2: Adaptive multi-way return predictor

2 level prediction scheme (Yeh and Patt)

ITANIUM

512 (128 x 4) entry branch prediction table (BPT)

2 bit saturating up-down counter to predict direction
Enhanced by 64-entry multi-way BPT
64-entry branch target address cache (BTAC)
8-entry return stack buffer (RSB)
44

Branch Prediction
Resteer3 & 4
Two branch address calculators (BAC1 and BAC2)
Correction to earlier predictions (if any)
A special perfect-exit-loop-predictor

In case of misses in earlier structures

Use of a static prediction information from bundles

ITANIUM

Instruction Dispersal

ITANIUM

Instruction Dispersal
Stop bits eliminate dependency checking
Templates simplify routing
Map instructions to first available of 9 issue
ports
Keep issuing until stop bit
Resource over-subscription or asymmetry

Re-map virtual register to physical register

Instruction granular
ITANIUM

Instruction Delivery

Data Stream
The Execution Core

Karthik Sankaranarayanan

ITANIUM

Recap - Execution Units

17 units + ALAT

4
4
2
2
3

ALU
MMX
+ 2 FMAC
Load/ Store
branch

Issue Ports

ITANIUM

2
2
2
3

I
M
F
B
50

Register Files
Integer
128 64-bit
8 read ports (2 x 2 I units, 2 x 2 M units)
6 write ports (1 x 2 I units, 2 x 2 Loads - A.I)

Floating Point
128 82-bit (double extended)
8 read ports (2 x 2 F units, 2 x 2 M units)
4 write ports (2 x 2 F units, 2 x 2 M units)

Predicate
64 1-bit , broadside R/W
15 read ports (2 x 6 - M, F, I units & 3B units)
11 write ports
(2 x 2 M units, 2 x 2 I units, 2 x 1 F unit, 1 x 1 Reg.
Rot.)
ITANIUM

Recap - 10 Stage Pipeline

ITANIUM

Operand Delivery - WLD/REG Stages

Register Scoreboard
Hazard detection
Stall only dependent instructions
Include predicates
cmp.eq
cmp.eq r1,r2
r1,r2 -->
--> p1,p3
p1,p3
(p1)
(p1) ld4[r3]
ld4[r3] -->
--> r4
r4
add
add r4,
r4, r1
r1 -->
--> r5
r5 (no
(no dependence
dependence if
if p1=0)
p1=0)

Defer stalls
ITANIUM

Operand Delivery
Deferred Stall

Stall actually in EXE stage

Clock frequency
Operand read over - cant re-read
Snoop the register bypass network

OLM - Operand Latch Manipulation

ITANIUM

Execution
Deferred Stall
Execute
Writes turned off at retirement for false predicates
Different latencies - Out Of Order Execution
In-order retire - scoreboard
cmp.eq
cmp.eq r1,r2
r1,r2 -->
--> p1,p3
p1,p3
cmp.eq
cmp.eq r7,r8
r7,r8 -->
--> p5,p7
p5,p7
(p1)
(p1) ld4[r3]
ld4[r3] -->
--> r4
r4 (reads
(reads p1
p1 in
in EXE)
EXE)
(p5)
(p5) add
add r4,
r4, r1
r1 -->
--> r5
r5 (reads
(reads p5
p5 in
in REG)
REG)

Predicates
Producer reads in EXE
Consumer reads in REG
ITANIUM

Execution
Predicates
Forward as soon as possible
Minimize forwarding logic
Predicate generation - deterministic latency

Separate Register file

Speculative, Architectural (SPRF, APRF)
Shadow state
Bypass paths to eliminate false stalls
ITANIUM

DET/ WRB - Parallel Branches

Multi-way branches - speculation + predication

B units - up to 3 branches parallel execution
Execution in DET stage
Can use predicates in the same bundle

Software pipeline support - LC, EC

ITANIUM

DET/ WRB - Parallel Branches

Control Speculation

ld.s, chk.s
Exception Deferral - NaTs, NaTVals (poison bits!)
Store NaTs? - store.spill, ld.fill (context switch)
UNaT, RNaT

Data Speculation
ld.a, chk.a, ld.c
ld.c can be issued with dependent instructions
ALAT - 32 entries, Register ID, Address, Size

In-order retirement (branch misprediction/

flush).
ITANIUM

FPU Details

Pipelined FMACs (A*B + C) (5 cycles)

4 DP ops/ 8 SP (SIMD) ops per cycle
Divide/ Square root - S/W pipeline
FP CMP operations (2 cycles)
direct L2 cache contact - 2 ldf pair / cycle
setf, getf, XMA, status registers

ITANIUM

Memory Subsystem

Address translation
32 entry L1 DTLB, 96 entry L2 DTLB, Page size 4K - 256 M
Regions for sharing, , Keys for protection
Hardware page walker
ITANIUM

Memory Subsystem
L1 Data
16 K, 4-way, 32 byte lines
write through, no write allocate
dual ported, 2 cycle load latency

L2, on chip, unified

96 K, 6 way, 64 byte lines, Write back, write allocate
Dual ported, 6 cycles Int, 9 cycles FP load latencies
MESI protocol for coherence

L3, off chip, on package, unified

4 M, 4-way, 64 byte lines
21-24 cycle latency, 128 bit bus
ITANIUM

Memory Subsystem
Caches
Hints
FP NT1 = Int NT2
Bias - Easier MESI

ITANIUM

Rest of the Processor

System Bus

64 bit, 2.1GB/s,
Multidrop , Split transaction bus
Up to 56 outstanding transactions
Optimized MESI protocol
Glue-less multiprocessor support (Up to 4)

IA 32 control
ECC/Parity coverage of processor and bus
Read only structures - parity
Data - ECC.
ITANIUM

Putting It All Together

The Block Diagram

ITANIUM

Conclusions
To Sum Up

ITANIUM

Conclusions

Complexity shift to compilers

Methods to express compile time information
Large register files, EPIC specific Hardware
Optimized FPUs for multimedia applications
Large L3 cache
Reliability and performance - server side

Neat design, Let us see if it succeeds

ITANIUM

Making Embedded Systems
100% (1)
Making Embedded Systems
314 pages
Itanium Processor: Presented by
No ratings yet
Itanium Processor: Presented by
26 pages
Itanium Processor Seminar Report
No ratings yet
Itanium Processor Seminar Report
30 pages
Itanium Processor Seminar Report
No ratings yet
Itanium Processor Seminar Report
30 pages
64bit Microprocessor
No ratings yet
64bit Microprocessor
22 pages
Itanium Processor Seminar
No ratings yet
Itanium Processor Seminar
30 pages
Itanium - Ua Ovw
No ratings yet
Itanium - Ua Ovw
23 pages
3 The How'S & Wows of Itanium 5 Epic 6
No ratings yet
3 The How'S & Wows of Itanium 5 Epic 6
27 pages
5.IA 64 and Itanium Processors
No ratings yet
5.IA 64 and Itanium Processors
9 pages
Tanium Rocessor Icroarchitecture: Harsh Sharangpani Ken Arora Intel
No ratings yet
Tanium Rocessor Icroarchitecture: Harsh Sharangpani Ken Arora Intel
20 pages
Onur 447 Spring15 Lecture17 Memoryhierarchyandcaches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture17 Memoryhierarchyandcaches Afterlecture
51 pages
10.Week
No ratings yet
10.Week
35 pages
Report On Intel Atenium 2
No ratings yet
Report On Intel Atenium 2
11 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Mudge Mpsoc
No ratings yet
Mudge Mpsoc
47 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
18 pages
Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger
No ratings yet
Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger
22 pages
tesi
No ratings yet
tesi
101 pages
Tanium Rocessor Icroarchitecture: Cameron Mcnairy Intel Don Soltis Hewlett-Packard
No ratings yet
Tanium Rocessor Icroarchitecture: Cameron Mcnairy Intel Don Soltis Hewlett-Packard
12 pages
Department of Computer Science and Engineering Subject Name: Advanced Computer Architecture Code: Cs2354
No ratings yet
Department of Computer Science and Engineering Subject Name: Advanced Computer Architecture Code: Cs2354
7 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
ILP-Architectures Part III
No ratings yet
ILP-Architectures Part III
49 pages
Architecture-and-micro
No ratings yet
Architecture-and-micro
69 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
CH18 COA11e
No ratings yet
CH18 COA11e
40 pages
Instruction Level Parallelism and Superscalar Processors
No ratings yet
Instruction Level Parallelism and Superscalar Processors
34 pages
Chap 5
No ratings yet
Chap 5
60 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Presentation Cea Chapter16 2 Demo
No ratings yet
Presentation Cea Chapter16 2 Demo
30 pages
Intel® Itanium™ Processor Core: Harsh Sharangpani
No ratings yet
Intel® Itanium™ Processor Core: Harsh Sharangpani
15 pages
08 Isa
No ratings yet
08 Isa
49 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Itanium Processor: Presented by Name-Mohammad Faizan Akhter Branch-ETC (Section) Semester-6 Regd No-1801289179
No ratings yet
Itanium Processor: Presented by Name-Mohammad Faizan Akhter Branch-ETC (Section) Semester-6 Regd No-1801289179
18 pages
Software Pipelining: An Alternative Method of Reorganizing Loops To Increase Instruction Level Parallelism
No ratings yet
Software Pipelining: An Alternative Method of Reorganizing Loops To Increase Instruction Level Parallelism
14 pages
Compiler and Virtual Machine of A M
No ratings yet
Compiler and Virtual Machine of A M
9 pages
Pentium-4 RNM Final
No ratings yet
Pentium-4 RNM Final
27 pages
Module 5 IA 32 and IA 64 architectures
No ratings yet
Module 5 IA 32 and IA 64 architectures
6 pages
CH - 14 - Instruction Level Parallelism and Superscalar Processors
No ratings yet
CH - 14 - Instruction Level Parallelism and Superscalar Processors
42 pages
Fundamentals of Computer Assignment Report
No ratings yet
Fundamentals of Computer Assignment Report
27 pages
Building Accelerated Applications With Vitis Workshop - Slides
No ratings yet
Building Accelerated Applications With Vitis Workshop - Slides
33 pages
unit4.aca
No ratings yet
unit4.aca
6 pages
Aca Notes
No ratings yet
Aca Notes
23 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
2 pages
Prof. Dr. Muhammad Iram Baig: M.Sc. Electrical Engineering Spring 2019
No ratings yet
Prof. Dr. Muhammad Iram Baig: M.Sc. Electrical Engineering Spring 2019
34 pages
Vliw Architecture
No ratings yet
Vliw Architecture
30 pages
Lecture7 Embedded Software
No ratings yet
Lecture7 Embedded Software
87 pages
Comparch 2015 S 03
No ratings yet
Comparch 2015 S 03
44 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Be A Binary Rockstar
No ratings yet
Be A Binary Rockstar
77 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
64 Bit Processor Report
No ratings yet
64 Bit Processor Report
15 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
Architecture PDF
No ratings yet
Architecture PDF
19 pages
CPU Structure & Functions
No ratings yet
CPU Structure & Functions
44 pages
07 Basicx86Architecture 1up
No ratings yet
07 Basicx86Architecture 1up
72 pages
14.25 Tao Liu Richard Ho UVM Based RISC V Processor Verification Platform
No ratings yet
14.25 Tao Liu Richard Ho UVM Based RISC V Processor Verification Platform
22 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Pic® Micro Principles V11
From Everand
Pic® Micro Principles V11
Clive W. Humphris
No ratings yet
2012 DW Permit Fee Calculator For Website 1.12
No ratings yet
2012 DW Permit Fee Calculator For Website 1.12
1 page
Uzi/M6417, M 71 W75': Aug. 12, 1941-P. R. Seemiller
No ratings yet
Uzi/M6417, M 71 W75': Aug. 12, 1941-P. R. Seemiller
7 pages
Activity Creating A Matrix Report PDF
No ratings yet
Activity Creating A Matrix Report PDF
3 pages
Mil DTL 44436B - 4 4 2012
No ratings yet
Mil DTL 44436B - 4 4 2012
22 pages
Addressing Modes of 8086
0% (1)
Addressing Modes of 8086
4 pages
Microprocessor April 2021
No ratings yet
Microprocessor April 2021
2 pages
PIC16F877A
100% (2)
PIC16F877A
5 pages
Pic Microcontroller Block Diagram
No ratings yet
Pic Microcontroller Block Diagram
15 pages
04 Instruction Execution Cycle
No ratings yet
04 Instruction Execution Cycle
23 pages
Building A Data Path
No ratings yet
Building A Data Path
15 pages
Mpi Assignment 2
No ratings yet
Mpi Assignment 2
8 pages
Chapter 2
No ratings yet
Chapter 2
67 pages
The AVR Microcontroller and C Compiler
No ratings yet
The AVR Microcontroller and C Compiler
6 pages
Unit 1
No ratings yet
Unit 1
43 pages
16_f78+RAD750+Component_datasheet_web (1)
No ratings yet
16_f78+RAD750+Component_datasheet_web (1)
2 pages
Chapter 3
No ratings yet
Chapter 3
15 pages
6 Mips Datapath
No ratings yet
6 Mips Datapath
55 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
252046
No ratings yet
252046
318 pages
486-PVT (Rev. B2)
No ratings yet
486-PVT (Rev. B2)
14 pages
Math Coprocessor
No ratings yet
Math Coprocessor
4 pages
COMA MCQs v2
No ratings yet
COMA MCQs v2
7 pages
Microprocessor and Microcontroller (III EEE)
No ratings yet
Microprocessor and Microcontroller (III EEE)
270 pages
Course Name:: EE 3541: Introduction To Microprocessors
No ratings yet
Course Name:: EE 3541: Introduction To Microprocessors
12 pages
GPGPU Sim Tutorial
No ratings yet
GPGPU Sim Tutorial
28 pages
This Study Resource Was: Program 1
No ratings yet
This Study Resource Was: Program 1
3 pages
Computer Architecture
No ratings yet
Computer Architecture
2 pages
Solution of CSE340 Assignment 3 Spring 2022
No ratings yet
Solution of CSE340 Assignment 3 Spring 2022
7 pages
Sm-j810f Common Eplis 11
No ratings yet
Sm-j810f Common Eplis 11
45 pages
Embedded Computer Architecture 5SIA0: Overview + Guidelines
No ratings yet
Embedded Computer Architecture 5SIA0: Overview + Guidelines
23 pages
Unit-Viii: Arm 32-Bit Mcus: Architecture, Programming, & Development Tools
No ratings yet
Unit-Viii: Arm 32-Bit Mcus: Architecture, Programming, & Development Tools
16 pages
CSE - CS401 - COMPUTER ORGANIZATION AND ARCHITECTURE - R21 - Booklet
No ratings yet
CSE - CS401 - COMPUTER ORGANIZATION AND ARCHITECTURE - R21 - Booklet
2 pages
Processors:: INTEL 8086
No ratings yet
Processors:: INTEL 8086
10 pages
Re Installation Log
No ratings yet
Re Installation Log
39 pages

Itanium: An EPIC Architecture

Uploaded by

Itanium: An EPIC Architecture

Uploaded by

ITANIUM

Groups of independent instructions

64-byte line size

Instruction set architecture (Marco)

Instruction Set Architecture

Why & How

RISC/CISC basic architecture of HPs PA-RISC,

Bundles: More than one per cycle

Template: MII, MIB other combinations

cmp.ne p1 = r0, r0;

cmp.eq p1,p2 = r1, r2;;

Criticism: Large registers

Context Switch Instructions

Branch Prediction Hints

Branch Prediction Hints

Branch Prediction Hints

Integer and Shifting

Floating Point Architecture

X86 common SW base (aggressive)

Overview of EPIC Hardware

10 Stage In-order Core Pipeline

Register scoreboard to enforce dependencies

I-Cache filters prefetch requests

Fetch & Prefetch

Triggered by BPR hints

Eliminate I-fetch bubbles

Fetch & Prefetch

Branch Prediction for remaining cases

Hierarchy of branch predictors

Branch hints + Predictor Hierarchy

Resteer2: Adaptive multi-way return predictor

512 (128 x 4) entry branch prediction table (BPT)

In case of misses in earlier structures

Re-map virtual register to physical register

Recap - Execution Units

Recap - 10 Stage Pipeline

Operand Delivery - WLD/REG Stages

Stall actually in EXE stage

OLM - Operand Latch Manipulation

Separate Register file

DET/ WRB - Parallel Branches

Multi-way branches - speculation + predication

Software pipeline support - LC, EC

DET/ WRB - Parallel Branches

In-order retirement (branch misprediction/

Pipelined FMACs (A*B + C) (5 cycles)

L2, on chip, unified

L3, off chip, on package, unified

Rest of the Processor

Putting It All Together

Complexity shift to compilers

Neat design, Let us see if it succeeds

You might also like