Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
overview
100,000
10,000
Performance
1,000
Processor Processor-Memory
100 Performance Gap
Growing
10
Memory
1
1980 1985 1990 1995 2000 2005 2010
Year
2.50
Access time (ns)
1.50
1.00
0.50
-
16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB
Cache size
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults;
a form of speculative execution
Best: 4x2
Reference Mflop/s
Adapted from Patterson and Hennessey
(Morgan Kauffman Pubs)
2
1
1 2 4 8
column block size (c)
• All possible column block sizes selected for 8 computers; How could
compiler know?
Adapted from Patterson and Hennessey
(Morgan Kauffman Pubs)
Mi
ss
Hit Band- Miss HW cost/
Technique Time width
pe
rate complexity Comment
nal
ty
Way-predicting caches
+ 1 Used in Pentium 4
Trace caches
+ 3 Used in Pentium 4
Nonblocking caches
+ + 3 Widely used
Used in L2 of Opteron and
Banked caches
+ 1 Niagara
Critical word first and early
restart + 2 Widely used
Widely used with write
Merging write buffer
+ 1 through
Software is a challenge;
Compiler techniques to reduce
some computers have
cache misses
+ 0 compiler option
Many prefetch instructions;
Hardware prefetching of
2 instr., 3 AMD Opteron prefetches
instructions and data
+ + data data
Adapted from Patterson and Hennessey
Compiler-controlled
(Morgan Kauffman Pubs) Needs nonblocking cache; in
prefetching + + 3 many CPUs