Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
Computer Architecture
Spring
S i 2009
Harvard University
Loads
• FP pprograms
g on average:
g AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
Reducing Misses by Hardware
Prefetching of Instructions & Data
• Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer” not the cache
– On Access: check both cache and stream buffer
– On SB Hit: move line into cache
– On SB Miss: Clear and refill SB with successive lines
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4
streams got 43%
– Palacharla & Kessler [[1994]] for scientific pprograms
g for 8 streams ggot
50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory bandwidth that can
be used without penalty
Computer Science 246
David Brooks
Hardware Prefetching
• What to prefetch?
p
– One block ahead (spatially)
• What will this work well for?
– Address prediction for non-sequential data
• Correlated predictors (store miss, next_miss pairs in table)
• Jump-pointers
pp (augment
( g data structures with prefetch
p pointers)
p )
• When to prefetch?
– On everyy reference
– On a miss (basically doubles block size!)
– When resident data becomes “dead” -- how do we know?
• No one will use it anymore, so it can be kicked out
Computer Science 246
David Brooks
Reducing Misses by
S f
Software P f hi D
Prefetching Data
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults; a form of speculative
execution
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
register
• Must be correct address and register!
– Non-Binding prefetch: Load into cache.
• Can be incorrect.
incorrect Faults?
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher
Hi h superscalar l reduces
d difficulty
diffi lt off issue
i bandwidth
b d idth
Computer Science 246
David Brooks
Reducing Hit Times
• Some common techniques/trends
q
– Small and simple caches
• Pentium III – 16KB L1
• Pentium
i 4 – 8KB
8 L11
– Pipelined Caches (actually bandwidth increase)
• Pentium – 1 clock cycle
y I-Cache
• Pentium III – 2 clock cycle I-Cache
• Pentium 4 – 4 clock cycle I-Cache
– Trace
T Caches
C h
• Beyond spatial locality
• Dynamic sequences of instruction (including taken branches)