0% found this document useful (0 votes)
6 views11 pages

Lecture Slides 07 076-Caches-Opt

Uploaded by

yihuangece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

Lecture Slides 07 076-Caches-Opt

Uploaded by

yihuangece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

University of Washington

Section 7: Memory and Caches


 Cache basics
 Principle of locality
 Memory hierarchies
 Cache organization
 Program optimizations that consider caches

Caches and Program Optimizations


University of Washington

Optimizations for the Memory Hierarchy


 Write code that has locality
 Spatial: access data contiguously
 Temporal: make sure access to the same data is not too far apart in time
 How to achieve?
 Proper choice of algorithm
 Loop transformations

Caches and Program Optimizations


University of Washington

Example: Matrix Multiplication


c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */


void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
c[i*n + j] += a[i*n + k]*b[k*n + j];
}

j
c a b
=i *

Caches and Program Optimizations


University of Washington

Cache Miss Analysis


 Assume:
 Matrix elements are doubles
 Cache block = 64 bytes = 8 doubles
 Cache size C << n (much smaller than n)

n
 First iteration:
 n/8 + n = 9n/8 misses
(omitting matrix c)
= *
 Afterwards in cache:
(schematic)
= *
8 wide
Caches and Program Optimizations
University of Washington

Cache Miss Analysis


 Assume:
 Matrix elements are doubles
 Cache block = 64 bytes = 8 doubles
 Cache size C << n (much smaller than n)

n
 Other iterations:
 Again:
n/8 + n = 9n/8 misses
(omitting matrix c) = *
8 wide

 Total misses:
 9n/8 * n2 = (9/8) * n3

Caches and Program Optimizations


University of Washington

Blocked Matrix Multiplication


c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */


void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i+=B)
for (j = 0; j < n; j+=B)
for (k = 0; k < n; k+=B)
/* B x B mini matrix multiplications */
for (i1 = i; i1 < i+B; i1++)
for (j1 = j; j1 < j+B; j1++)
for (k1 = k; k1 < k+B; k1++)
c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1];
}

j1
c a b
=i1 *

Block size B x B
Caches and Program Optimizations
University of Washington

Cache Miss Analysis


 Assume:
 Cache block = 64 bytes = 8 doubles
 Cache size C << n (much smaller than n)
 Three blocks fit into cache: 3B2 < C

n/B blocks
 First (block) iteration:
 B2/8 misses for each block
 2n/B * B2/8 = nB/4
(omitting matrix c) = *
Block size B x B
 Afterwards in cache
(schematic)
= *
Caches and Program Optimizations
University of Washington

Cache Miss Analysis


 Assume:
 Cache block = 64 bytes = 8 doubles
 Cache size C << n (much smaller than n)
 Three blocks fit into cache: 3B2 < C

n/B blocks
 Other (block) iterations:
 Same as first iteration
 2n/B * B2/8 = nB/4
= *
 Total misses:
Block size B x B
 nB/4 * (n/B)2 = n3/(4B)

Caches and Program Optimizations


University of Washington

Summary
 No blocking: (9/8) * n3
 Blocking: 1/(4B) * n3
 If B = 8 difference is 4 * 8 * 9 / 8 = 36x
 If B = 16 difference is 4 * 16 * 9 / 8 = 72x

 Suggests largest possible block size B, but limit 3B2 < C!

 Reason for dramatic difference:


 Matrix multiplication has inherent temporal locality:
 Input data: 3n2, computation 2n3
 Every array element used O(n) times!
 But program has to be written properly

Caches and Program Optimizations


University of Washington

Cache-Friendly Code
 Programmer can optimize for cache performance
 How data structures are organized
 How data are accessed
 Nested loop structure
 Blocking is a general technique
 All systems favor “cache-friendly code”
 Getting absolute optimum performance is very platform specific
Cache sizes, line sizes, associativities, etc.
 Can get most of the advantage with generic code
 Keep working set reasonably small (temporal locality)
 Use small strides (spatial locality)
 Focus on inner loop code

Caches and Program Optimizations


University of Washington

Intel Core i7
The Memory Mountain 32 KB L1 i-cache
32 KB L1 d-cache
256 KB unified L2 cache
7000 8M unified L3 cache
Read throughput (MB/s)

L1
6000 All caches on-chip

5000

4000
L2
3000

2000
L3
1000

4K
16K
0
64K
256K

Mem
s1
s3

1M
s5
s7

4M
16M
s9

64M
s11
s13
s15

Stride (x8 bytes) Working set size (bytes)


s32

Caches and Program Optimizations

You might also like