08 Caches
08 Caches
Computer
system:
Plot
SIZE
Autumn 2013 Memory and Caches 2
University of Washington
Actual Data
Time
Autumn 2013
SIZE
Memory and Caches 3
University of Washington
Processor performance
doubled about
every 18 months Bus bandwidth
evolved much slower
CPU Reg
Main
Memory
Processor performance
doubled about
every 18 months Bus bandwidth
evolved much slower
CPU Reg
Main
Cache
Memory
Solution: caches
cycle = single fixed-time
Autumn 2013 machine step Memory and Caches 6
University of Washington
Cache
¢ English definition: a hidden storage space for provisions,
weapons, and/or treasures
more generally,
8 9 10 11
12 13 14 15
Block b is in cache:
Cache 8 9 14 3
Hit!
Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future
¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future
¢ Spatial locality?
¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future
¢ Spatial locality:
§ Items with nearby addresses tend
to be referenced close together in time block
Example: Locality?
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
¢ Data:
§ Temporal: sum referenced in each iteration
§ Spatial: array a[] accessed in stride-1 pattern
¢ Instructions:
§ Temporal: cycle through loop repeatedly
§ Spatial: reference instructions in sequence
Locality Example #1
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (i = 0; i < M; i++) a[2][0] a[2][1] a[2][2] a[2][3]
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
Locality Example #1
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (i = 0; i < M; i++) a[2][0] a[2][1] a[2][2] a[2][3]
for (j = 0; j < N; j++)
sum += a[i][j]; 1: a[0][0]
return sum; 2: a[0][1]
} 3: a[0][2]
4: a[0][3]
5: a[1][0]
6: a[1][1]
7: a[1][2]
8: a[1][3]
9: a[2][0]
10: a[2][1]
11: a[2][2]
12: a[2][3]
stride-1
Autumn 2013 Memory and Caches 17
University of Washington
Locality Example #2
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (j = 0; j < N; j++) a[2][0] a[2][1] a[2][2] a[2][3]
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
Locality Example #2
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (j = 0; j < N; j++) a[2][0] a[2][1] a[2][2] a[2][3]
for (i = 0; i < M; i++)
sum += a[i][j]; 1: a[0][0]
return sum; 2: a[1][0]
} 3: a[2][0]
4: a[0][1]
5: a[1][1]
6: a[2][1]
7: a[0][2]
8: a[1][2]
9: a[2][2]
10: a[0][3]
11: a[1][3]
12: a[2][3]
stride-N
Autumn 2013 Memory and Caches 19
University of Washington
Locality Example #3
int sum_array_3d(int a[M][N][N])
{
int i, j, k, sum = 0;
¢ Hit Time
§ Time to deliver a line in the cache to the processor
§Includes time to determine whether the line is in the cache
§ Typical hit times: 1 - 2 clock cycles for L1; 5 - 20 clock cycles for L2
¢ Miss Penalty
§ Additional time required because of a miss
§ Typically 50 - 200 cycles for L2 (trend: increasing!)
Memory Hierarchies
¢ Some fundamental and enduring properties of hardware and
software systems:
§ Faster storage technologies almost always cost more per byte and have
lower capacity
§ The gaps between memory technology speeds are widening
§ True for: registers ↔ cache, cache ↔ DRAM, DRAM ↔ disk, etc.
§ Well-written programs tend to exhibit good locality
on-chip L1
Smaller, cache (SRAM) L1 cache holds cache lines retrieved from L2 cache
faster,
costlier
off-chip L2
per byte cache (SRAM) L2 cache holds cache lines retrieved
from main memory
on-chip L1
Smaller, cache (SRAM)
program sees “memory”;
faster,
costlier hardware manages caching
off-chip L2 transparently
per byte cache (SRAM)
Memory Hierarchies
¢ Fundamental idea of a memory hierarchy:
§ For each k, the faster, smaller device at level k serves as a cache for the
larger, slower device at level k+1.
¢ Why do memory hierarchies work?
§ Because of locality, programs tend to access the data at level k more
often than they access the data at level k+1.
§ Thus, the storage at level k+1 can be slower, and thus larger and
cheaper per bit.
¢ Big Idea: The memory hierarchy creates a large pool of
storage that costs as much as the cheap storage near the
bottom, but that serves data to programs at the rate of the
fast storage near the top.
L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles
Main memory
Autumn 2013 Memory and Caches 29
University of Washington
A puzzle.
¢ What can you infer from this:
35
Autumn 2013 Memory and Caches
University of Washington
Associativity
¢ What if we could store data in any place in the cache?
Associativity
¢ What if we could store data in any place in the cache?
¢ That might slow down caches (more complicated hardware), so
we do something in between.
¢ Each address maps to exactly one set.
1-way 2-way 4-way 8-way
8 sets, 4 sets, 2 sets, 1 set,
1 block each 2 blocks each 4 blocks each 8 blocks
? bits ? bits
?-bits Block
4-bit Address Offset
Block replacement
¢ Any empty block in the correct set may be used for storing data.
¢ If there are no empty blocks, which one should we replace?
43
Autumn 2013 Memory and Caches
University of Washington
Block replacement
¢ Replace something, of course, but what?
44
Autumn 2013 Memory and Caches
University of Washington
Block replacement
¢ Replace something, of course, but what?
§ Obvious for direct-mapped caches, what about set-associative?
45
Autumn 2013 Memory and Caches
University of Washington
Block replacement
¢ Replace something, of course, but what?
§ Caches typically use something close to least recently used (LRU)
§ (hardware usually implements “not most recently used”)
46
Autumn 2013 Memory and Caches
University of Washington
Another puzzle.
¢ What can you infer from this:
direct-mapped cache
Autumn 2013 Memory and Caches 47
University of Washington
line
S = 2s sets
cache size:
v tag 0 1 2 B-1
S x E x B data bytes
valid bit
B = 2b bytes of data per cache line (the data block)
Autumn 2013 Memory and Caches 48
University of Washington
S = 2s sets
tag set block
index offset
v tag 0 1 2 B-1
valid bit
B = 2b bytes of data per cache line (the data block)
Autumn 2013 Memory and Caches 49
University of Washington
Address of int:
v tag 0 1 2 3 4 5 6 7
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
find set
S = 2s sets
v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7
Address of int:
valid? + match?: yes = hit
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
block offset
Address of int:
valid? + match?: yes = hit
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
block offset
y[0] x[1]
x[0] y[1] x[2] y[3]
y[2] x[3] x[0] x[1] x[2] x[3]
x[4] x[5] x[6] x[7]
if x and y have aligned if x and y have unaligned y[0] y[1] y[2] y[3]
starting addresses, starting addresses,
y[4] y[5] y[6] y[7]
e.g., &x[0] = 0, &y[0] = 128 e.g., &x[0] = 0, &y[0] = 160
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag
tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
block offset
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
block offset
No match:
• One line in set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
Autumn 2013 Memory and Caches 57
University of Washington
Example (for E = 2)
float dotprod(float x[8], float y[8])
{
float sum = 0;
int i;
If x and y have aligned starting x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]
addresses, e.g. &x[0] = 0, &y[0] = 128, x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]
can still fit both because two lines in
each set
Memory T 0xCAFE
U 0xBEEF
Cache T
U 0xFEED
0xCAFE
0xFACE
0xBEEF 0
1 dirty bit
Memory T 0xCAFE
U 0xBEEF
Memory T 0xFEED
U 0xBEEF
L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles
slower, but
Main memory more likely
to hit
Autumn 2013 Memory and Caches 65
University of Washington
j j
c a b
=i *
i
spatial locality:
Cache Miss Analysis chunks of 8 items in a row
in same cache line
¢ Assume: each item in column in
§ Matrix elements are doubles different cache line
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n, not left-shifted by n)
n/8 misses n
¢ First iteration: …
§ n/8 + n = 9n/8 misses
n misses
(omitting matrix c)
= *
§ Afterwards in cache:
(schematic)
= *
8 wide
Autumn 2013 Memory and Caches 70
University of Washington
¢ Total misses:
§ 9n/8 * n2 = (9/8) * n3
j1
c a b
= i1 *
Block size B x B
Autumn 2013 Memory and Caches 72
University of Washington
n/B blocks
¢ Other (block) iterations:
§ Same as first iteration
§ 2n/B * B2/8 = nB/4
= *
Summary
¢ No blocking: (9/8) * n3
¢ Blocking: 1/(4B) * n3
¢ If B = 8 difference is 4 * 8 * 9 / 8 = 36x
¢ If B = 16 difference is 4 * 16 * 9 / 8 = 72x
Cache-Friendly Code
¢ Programmer can optimize for cache performance
§ How data structures are organized
§ How data are accessed
§ Nested loop structure
§ Blocking is a general technique
¢ All systems favor “cache-friendly code”
§ Getting absolute optimum performance is very platform specific
§Cache sizes, line sizes, associativities, etc.
§ Can get most of the advantage with generic code
§ Keep working set reasonably small (temporal locality)
§ Use small strides (spatial locality)
§ Focus on inner loop code
L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles
Main memory
Autumn 2013 Memory and Caches 77
University of Washington
Intel Core i7
The Memory Mountain 32 KB L1 i-cache
32 KB L1 d-cache
256 KB unified L2 cache
8M unified L3 cache
Read throughput (MB/s)
7000
L1 All caches on-chip
6000
5000
4000
L2
3000
2000
L3
1000
2K
Mem
s1
16K
s3
s5
128K
s7
s9
1M
s11
s13
8M
s15