Lecture Slides 07 076-Caches-Opt
Lecture Slides 07 076-Caches-Opt
j
c a b
=i *
n
First iteration:
n/8 + n = 9n/8 misses
(omitting matrix c)
= *
Afterwards in cache:
(schematic)
= *
8 wide
Caches and Program Optimizations
University of Washington
n
Other iterations:
Again:
n/8 + n = 9n/8 misses
(omitting matrix c) = *
8 wide
Total misses:
9n/8 * n2 = (9/8) * n3
j1
c a b
=i1 *
Block size B x B
Caches and Program Optimizations
University of Washington
n/B blocks
First (block) iteration:
B2/8 misses for each block
2n/B * B2/8 = nB/4
(omitting matrix c) = *
Block size B x B
Afterwards in cache
(schematic)
= *
Caches and Program Optimizations
University of Washington
n/B blocks
Other (block) iterations:
Same as first iteration
2n/B * B2/8 = nB/4
= *
Total misses:
Block size B x B
nB/4 * (n/B)2 = n3/(4B)
Summary
No blocking: (9/8) * n3
Blocking: 1/(4B) * n3
If B = 8 difference is 4 * 8 * 9 / 8 = 36x
If B = 16 difference is 4 * 16 * 9 / 8 = 72x
Cache-Friendly Code
Programmer can optimize for cache performance
How data structures are organized
How data are accessed
Nested loop structure
Blocking is a general technique
All systems favor “cache-friendly code”
Getting absolute optimum performance is very platform specific
Cache sizes, line sizes, associativities, etc.
Can get most of the advantage with generic code
Keep working set reasonably small (temporal locality)
Use small strides (spatial locality)
Focus on inner loop code
Intel Core i7
The Memory Mountain 32 KB L1 i-cache
32 KB L1 d-cache
256 KB unified L2 cache
7000 8M unified L3 cache
Read throughput (MB/s)
L1
6000 All caches on-chip
5000
4000
L2
3000
2000
L3
1000
4K
16K
0
64K
256K
Mem
s1
s3
1M
s5
s7
4M
16M
s9
64M
s11
s13
s15