Computer Science 146 Computer Architecture
Computer Science 146 Computer Architecture
Computer Architecture
Spring 2004
Harvard University
Instructor: Prof. David Brooks
[email protected]
Lecture 12: Hardware Assisted Software
ILP and IA64/Itanium Case Study
Computer Science 146
David Brooks
Lecture Outline
Review of Global Scheduling, Hardware-Assisted
Software ILP
IA-64 Instruction Set Architecture
Itanium, Itanium-2 Processor
Trace Scheduling
Parallelism across IF branches vs. LOOP branches?
Two steps:
Trace Selection
Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is wrong
R1,40(R2)
BEQZ
LW
LW
R10,L
R8,0(R10)
R9,0(R8)
Predicated Loads
Use predicated version load word (LWC)?
load occurs unless the third operand is 0
Full Predication
Full Predication works better for long streams of code
Set-Predicate Instructions, e.g. seqzp
Instructions converted to predicated versions
If predicate is true perform op, otherwise ignore it
Normal Code
BEQZ R2, L
ADD R4, R6, R5
JUMP L2
L: ADD R4,R5,R6
L2:
Predicated Code
SEQZP P1,R2
ADD.np R4,R6,R5,p1
ADD.p R4,R5,R6,p1
Must be able to
Ignore exceptions in speculated instructions until they are
non-speculative
Speculatively interchange loads/stores and stores/stores
that may have address conflicts
Computer Science 146
David Brooks
LD
BNEZ
LD
J
L1:ADDI
L2:SD
R1,0(R3)
R1,L1
R1,0(R2)
L2
R1,R1,#4
R1,0(R3)
;Load A
;test A
; then clause
; skip else
; else clause
; store A
R1,0(R3)
R14,0(R2)
R1,L3
R14, R1, #4
R14,0(R3)
;Load A
;speculative Load B
;other branch of the if
;else clause
;store A
Special Instructions
Speculative loads (sLD) and Speculative Checks (SPECCK)
sLD will not generate exceptions
SPECCK will generate exceptions
LD
sLD
BNEZ
SPECCK
J
L1:ADDI
L2:SD
R1,0(R3)
R14,0(R2)
R1,L1
0(R2)
L2
R14,R1,#4
R14,0(R3)
;Load A
;speculative, no exceptions
;test A
;Perform Speculation Check
;skip else
;else clause
;store A
Poison Bits
Track exceptions as they occur
Postpone terminating exceptions until a value is used
Poision bits are added to every register
Set when speculative instruction causes a terminating fault
R1,0(R3)
R14,0(R2)
R1,L3
R14,R1,#4
R14,0(R3)
;Load A
;speculative Load B
;
;
;exception for speculative LW
Computer Science 146
David Brooks
Advantages of HW (Tomasulo)
vs. SW (VLIW) Speculation
HW advantages:
SW advantages:
IA-64 Registers
The integer registers designed to assist procedure calls using
a register stack
Similar to SPARCs register windows.
Registers 0-31 are always accessible and addressed as 0-31
Registers 32-128 are used as a register stack and each procedure is
allocated a set of registers (from 0 to 96)
The new register stack frame is created for a called procedure by
renaming the registers in hardware;
a special register called the current frame pointer (CFM) points to the
set of registers to be used by a given procedure
IA-64 Registers
Both the integer and floating point registers support
register rotation for registers 32-128.
Register rotation eases the task of allocating
registers in software pipelined loops
Avoid the need for unrolling and for prologue and
epilogue code for a software pipelined loop
makes the SW-pipelining usable for loops with smaller
numbers of iterations
Computer Science 146
David Brooks
10
IA-64 instructions are encoded in bundles, which are 128 bits wide.
Each bundle consists of a 5-bit template field and 3 instructions, each 41 bits
in length
3 Instructions in 128 bit groups; field determines if instructions dependent or
independent
Computer Science 146
David Brooks
Instruction
Description
Integer ALU
Non-ALU Int
Integer ALU
Mem access
Floating point
Branches
Extended
Example
Instructions
add, subtract, and, or, cmp
shifts, bit tests, moves
add, subtract, and, or, cmp
Loads, stores for int/FP regs
Floating point instructions
Conditional branches, calls
Extended immediates, stops
11
Template Examples
Template
Slot 0
Slot 1
Slot 2
28
29
Stop bits
Predication Support
Nearly all instructions are predicated
Conditional branches are predicated jumps!
12
Speculation Support
All INT registers have a 1-bit NaT (Not A Thing)
This is a poison bit (as discussed earlier)
Speculative loads generate these
All other instructions propagate them
Deferred exceptions
Nonspeculative exceptions receive a NAT as a source
operand there is an unrecoverable exception
Chk.s instructions can detect and branch to recovery
code
Computer Science 146
David Brooks
13
FPU
IA-64 Control
Integer Units
Instr.
Fetch &
Decode
Cache
TLB
Cache
Bus
4 x 1MB L3 cache
Computer Science 146
David Brooks
800 MHz
Transistor Count
Process
Package
Machine Width
Registers
Speculation
Branch Prediction
FP Compute Bandwidth
4 DP (8 SP) operands/clock
L2/L1 Cache
L2/L1 Latency
6 / 2 clocks
L3 Cache
System Bus
14
Execution
PrePre-fetch/Fetch of up
to 6 instructions/cycle
Hierarchy of branch
predictors
Decoupling buffer
EXPAND
IPG
INST POINTER
GENERATION
ROTATE
Instruction Delivery
Dispersal of up to 6
instructions on 9 ports
Reg. remapping
Reg. stack engine
RENAME
REN
WORD-LINE
REGISTER READ
DECODE
WL.D
REG
EXE
EXECUTE
DET
WRB
EXCEPTION WRITE-BACK
DETECT
Operand Delivery
Reg read + Bypasses
Register scoreboard
Predicated
dependencies
15
Instruction Dispersal
16
Predicate Delivery
Predicates generated in EXE delivered in DET and feed
into retirement, branch execution, dependency detect
All instructions read operands and execute
Canceled at retirement
Computer Science 146
David Brooks
17
2000
1500
1000
500
0
Itanium
800MHz
(96KB)
Onchip L2/L3
Itanium-2
1GHz
(3MB)
Itanium-2
1.5GHz
(6MB)
Pentium 4 Pentium 4
IBM
3.4GHZ
3.4GHZ POWER4
(.5MB)
(2.5MB)
1.7GHz
(1.5MB)
AMD
Opteron
2.2GHz
(1MB)
18