Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
Memory Hierarchy?
1000 CPU
µProc
60%/yr.
Caches and Memory Systems I
“Moore’s Law”
Performance
CPU-DRAM Gap
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10 “Less’ Law?” DRAM
DRAM
7%/yr.
1
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1980
1981
1982
1983
1984
1985
1999
2000
1986
• 1980: no cache in µproc; 1995 2-level cache on chip
(1989 first Intel µproc with a cache on chip)
– TLB a cache on page table Valid Bit Cache Tag Cache Data
– Branch-prediction a cache on prediction information?
: :
Byte 31 Byte 1 Byte 0 0
Proc/Regs 0x50 Byte 63 Byte 33 Byte 32 1
2
L1-Cache 3
Bigger L2-Cache Faster
: : :
Memory
:
Byte 1023 Byte 992 31
Disk, Tape, etc.
Set Associative Cache Disadvantage of Set Associative Cache
• N-way set associative: N entries for each Cache
Index • N-way Set Associative Cache versus Direct Mapped
– N direct mapped caches operates in parallel Cache:
– N comparators vs. 1
• Example: Two-way set associative cache – Extra MUX delay for the data
– Cache Index selects a “set” from the cache – Data comes AFTER Hit/Miss decision and set selection
– The two tags in the set are compared to the input in parallel
• In a direct mapped cache, Cache Block is available
– Data is selected based on the tag result
BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
Cache Index Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0 Cache Block 0 Cache Block 0
: : : : : : : : : : : :
OR OR
Cache Block Cache Block
Hit Hit
3. Reduce the time to hit in the cache. – Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due to
blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded and
later retrieved if too many blocks map to its set. Also called
collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
2:1 Cache Rule
3Cs Absolute Miss Rate
(SPEC92) miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14 0.14
1-way 1-way
0.12 Conflict 0.12 Conflict
2-way 2-way
0.1 0.1
4-way 4-way
0.08 0.08
8-way 8-way
0.06 0.06
Capacity Capacity
0.04 0.04
0.02 0.02
0 0
1
8
16
32
64
16
32
64
128
128
Compulsory vanishingly Compulsory Compulsory
Cache Size (KB) Cache Size (KB)
small
0% 3) Change Compiler:
Which of 3Cs is obviously affected?
1
16
32
64
128
15%
4K
• Beware: Execution time is only final measure!
Miss
16K – Will Clock Cycle time increase?
Rate
10% – Hill [1988] suggested hit time for 2-way vs. 1-way
64K external cache +10%,
internal + 2%
5% 256K
0%
16
32
64
128
Blocking Example
Loop Fusion Example /* Before */
/* Before */
for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j]; {r = 0;
for (i = 0; i < N; i = i+1) for (k = 0; k < N; k = k+1){
for (j = 0; j < N; j = j+1) r = r + y[i][k]*z[k][j];};
d[i][j] = a[i][j] + c[i][j]; x[i][j] = r;
/* After */ };
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) • Two Inner Loops:
{ a[i][j] = 1/b[i][j] * c[i][j]; – Read all NxN elements of z[]
d[i][j] = a[i][j] + c[i][j];} – Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
2 misses per access to a & c vs. one miss per • Capacity Misses a function of N & Cache Size:
access; improve spatial locality – 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
Summary of Compiler Optimizations to
Blocking Example Reduce Cache Misses (by hand)
vpenta (nasa7)
/* After */
for (jj = 0; jj < N; jj = jj+B) gmty (nasa7)
for (kk = 0; kk < N; kk = kk+B) tomcatv
for (i = 0; i < N; i = i+1)
btrix (nasa7)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0; mxm (nasa7)
for (k = kk; k < min(kk+B-1,N); k = k+1) {
spice
r = r + y[i][k]*z[k][j];}; cholesky
x[i][j] = x[i][j] + r; (nasa7)
}; compress
DRAM
(or lower mem)
miss rate
= ƒ(no. operations) Larger Block Size + – 0
Higher Associativity + – 1
• 1990 100 Victim Caches + 2
– Pipelined Pseudo-Associative Caches + 2
Execution & HW Prefetching of Instr/Data + 2
Fast Clock Rate Compiler Controlled Prefetching + 3
– Out-of-Order 10 Compiler Reduce Misses + 0
execution Priority to Read Misses + 1
miss penalty
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
miss rate
IF IS RF EX DF DS
Higher Associativity + – 1
IF IS RF EX DF
Victim Caches + 2
IF IS RF EX Pseudo-Associative Caches + 2
IF IS RF HW Prefetching of Instr/Data + 2
IF IS Compiler Controlled Prefetching + 3
IF Compiler Reduce Misses + 0
IF IS RF EX DF DS TC WB Priority to Read Misses + 1
THREE Cycle
penalty
Early Restart & Critical Word 1st + 2
miss
Branch Latency IF IS RF EX DF DS TC
Non-Blocking Caches + 3
(conditions evaluated IF IS RF EX DF DS Second Level Caches + 2
during EX phase) IF IS RF EX DF Better memory system + 3
IF IS RF EX
Delay slot plus two stalls Small & Simple Caches – + 0
IF IS RF Avoiding Address Translation + 2
hit time
Branch likely cancels delay slot if not taken
IF IS Pipelining Caches + 2
IF
Exercise 4