Cache 1 54
Cache 1 54
1
Contents
1. Memory hierarchy
1. Basic concepts
2. Design techniques
2. Caches
1. Types of caches: Fully associative, Direct mapped, Set associative
2. Ten optimization techniques
3. Main memory
1. Memory technology
2. Memory optimization
3. Power consumption
4. Memory hierarchy case studies: Opteron, Pentium, i7.
5. Virtual memory
6. Problem solving
dcm 2
Introduction
Introduction
Programmers want very large memory with low latency
Fast memory technology is more expensive per bit than
slower memory
Solution: organize memory system into a hierarchy
Entire addressable memory space available in largest, slowest
memory
Incrementally smaller and faster memories, each containing a
subset of the memory below it, proceed in steps up toward the
processor
Temporal and spatial locality insures that nearly all
references can be found in smaller memories
Gives the allusion of a large, fast memory being presented to the
processor
Processor
L1 Cache
L2 Cache
Latency
L3
Cache
Main Memory
My Power Book
Intel core i7
2 cores
2.8 GHz
L2 cache:
256 KB/core
L3 4MB
Main
memory 16
GB
two DDR3 8 MB
at 1.6 GHz
Disk 500 GB
= 409.6 GB/s!
Main
Memory Cache
Memory
13 5
14 6
15 7
16
17
18
19
20
21 A block can be placed in any
22
23
24
location in cache.
25
26
27
28
29
30
31
13
Direct mapped cache
Memory
0
1
2
3
4
5
6
7
Cache
8
0
9
1
10
2
11 3
12
Block number
4
13
5
14
6
15 7
16
17
18
19 (Block address) MOD (Number of blocks in cache)
20
21 12 MOD 8 = 4
22
23
24
25
26
A block can be placed ONLY
27
28 in a single location in cache.
29
30
31
14
Set associative cache
Memory
0
1
2
3
4
5
6
7
8 0
Cache Set no.0
9 1
Block number
10 2
11 3 1
12 4
Block number
13 5 2
14 6
15 7
3
16
17
(Block address) MOD (Number of sets in cache)
18
19
20 12 MOD 4 = 0
21
22
23 A block can be placed in one
24
25 of n locations in n-way set
26
27 associative cache.
28
29
30
31
15
Introduction
Memory hierarchy basics
16
Dirty bit
Two types of caches
Instruction cache : I-cache
Data cache: D-cache
Dirt bit indicates if the cache block has been written to or
modified.
No need for dirty bit for
I-caches
17
Write back
CPU
D Cache
Main memory
CPU
Cache
Main memory
20
Cache organization
::
= MU
X
Causes of misses
Compulsory first reference to a block
addresses from different blocks that map to the same location in the
cache
Metrics:
Reducing the hit time
Increase cache bandwidth
Reducing miss penalty
Reducing miss rate
Reducing miss penalty or miss rate via parallelism
Pentium: 1 cycle
Pentium Pro – Pentium III: 2 cycles
Pentium 4 – Core i7: 4 cycles
Increases branch mis-prediction penalty
Makes it easier to increase associativity
32
33
Advanced Optimizations
Nonblocking caches
Like pipelining the
memory system allow
hits before previous
misses complete
“Hit under miss”
“Hit under multiple miss”
Important for hiding
memory latency
L2 must support this
In general, processors can
hide L1 miss penalty but
not L2 miss penalty
38
Advanced Optimizations
Merging write buffer
When storing to a block that is already pending in the
write buffer, update write buffer
Reduces stalls due to full write buffer
Do not apply to I/O addresses
No write
buffering
Write buffering
40
Advanced Optimizations
Compiler optimizations
Loop Interchange
Swap nested loops to access memory in
sequential order
Blocking
Instead of accessing entire rows or columns,
subdivide matrices into blocks
Requires more memory accesses but improves
locality of accesses
42
Loop interchange example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
43
Loop fusion example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] *
c[i][j];
d[i][j] = a[i][j] + c[i]
[j];}
y[] repeatedly
Write N elements of 1 row
of x[]
Capacity Misses a function of N
& Cache Size: 45
Blocking example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
46
Snapshot of arrays x,y,z when N=6 and i =1
47
48
Reducing conflict misses by blocking
0.1
Dire ct M appe d
0.05
Cache
Fully Associativ e
Cache
0
0 50 100
150
Blocking Factor
Conflict misses in caches not FA vs. Blocking size
Lam et al [1991] a blocking factor of 24 had a fifth the
misses vs. 48 despite both fit in cache
49
Summary of compiler optimizations to reduce
cache misses (by hand)
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
chole sky
(nasa7)
compre ss
1 1.5 2 2.5 3
Performance Improvement
50
Advanced Optimizations
9) Hardware prefetching
Pentium 4 Pre-fetching
Register prefetch
Loads data into register
Cache prefetch
Loads data into cache
53
Advanced Optimizations
Summary