Lecture05 Memory Hierarchy Cache
Lecture05 Memory Hierarchy Cache
Example Memory
Hierarchy L0:
Smaller,
faster,
and
costlier
(per byte)
storage
devices
Larger,
slower,
and
cheaper
(per byte)
storage L5:
devices
L6:
L1:
L2:
L3:
Regs
L1 cache
(SRAM)
L2 cache
(SRAM)
L3 cache
(SRAM)
L3 cache holds cache lines
retrieved from main memory.
L4:
Main memory
(DRAM)
Cache
84
10
4
Memory
14
10
10
11
12
13
14
15
Cache Memories
CPU chip
Register file
Cache
memory
AL
U
System bus
Bus interface
I/O
bridge
Memory bus
Main
memor
y
5
S = 2s sets
valid bit
tag
0 1 2
B-1
Cache size:
C = S x E x B data bytes
Cache Read
E = 2e lines per set
Locate set
Check if any line in set
has matching tag
Yes + line valid: hit
Locate data starting
at offset
Address of word:
t bits
S = 2s sets
tag
s bits
b bits
set block
index offset
tag
0 1 2
B-1
valid bit
B = 2b bytes per cache block (the data)
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
Address of int:
t bits
001
100
find set
S = 2s sets
tag
Address of int:
t bits
001
100
0 1 2 3 4 5 6 7
block offset
tag
Address of int:
t bits
001
100
0 1 2 3 4 5 6 7
block offset
int (4 Bytes) is here
s=2
xx
b=1
x
Set 0
Set 1
Set 2
Set 3
v
0
1
Ta
g0
1?
Block
?
M[8-9]
M[0-1]
M[6-7]
11
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
001
100
find set
12
compare both
001
100
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
block offset
13
compare both
001
100
tag
0 1 2 3 4 5 6 7
tag
0 1 2 3 4 5 6 7
block offset
short int (2 Bytes) is here
No match:
One line in set is selected for eviction and replacement
Replacement policies: random, least recently used (LRU),
14
s=1
x
b=1
x
Ta
?g
00
10
Block
?
M[0-1]
M[8-9]
1
Set 1 0
01
M[6-7]
15
What to do on a write-hit?
Write-through (write immediately to memory)
Write-back (defer write to memory until replacement of line)
What to do on a write-miss?
Write-allocate (load into cache, update line in cache)
Good if more writes to the location follow
No-write-allocate (writes straight to memory, does not load into cache)
Typical
Write-through + No-write-allocate
Write-back + Write-allocate
16
Core 3
Regs
L1
i-cache
L2 unified cache
L1
d-cache
L1
i-cache
L2 unified cache
L3 unified cache
(shared by all cores)
Main memory
L2 unified cache:
256 KB, 8-way,
Access: 10 cycles
L3 unified cache:
8 MB, 16-way,
Access: 40-75 cycles
Block size: 64 bytes for
all caches.
17
Miss Rate
Fraction of memory references not found in cache (misses / accesses)
= 1 hit rate
Typical numbers (in percentages):
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
Time to deliver a line in the cache to the processor
includes time to determine whether the line is in the cache
Typical numbers:
4 clock cycle for L1
10 clock cycles for L2
Miss Penalty
Additional time required because of a miss
20
Today
21
22
Rearranging Loops
to Improve Spatial Locality
23
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
Description:
Multiply N x N matrices
Matrix elements are doubles (8 bytes)
O(N3) total operations
N reads per source element
N values summed per destination
but may be able to hold in register
24
Assume:
Block size = 32B (big enough for four doubles)
Matrix dimension (N) is very large
Approximate 1/N as 0.0
Cache is not even big enough to hold multiple rows
Analysis Method:
Look at access pattern of inner loop
j
k
B
25
Inner loop:
(*,j)
(i,j)
(i,*)
A
Row-wise Columnwise
Fixed
Inner loop:
(*,j)
(i,j)
(i,*)
A
Row-wise Columnwise
Fixed
Inner loop:
(i,k)
A
Fixed
(k,*)
B
(i,*)
C
Row-wise Row-wise
Inner loop:
(i,k)
A
Fixed
(k,*)
B
(i,*)
C
Row-wise Row-wise
Inner loop:
(*,k)
(*,j)
(k,j)
Columnwise
Fixed
Columnwise
Inner loop:
(*,k)
(*,j)
(k,j)
Columnwise
Fixed
Columnwise
33
ijk / jik
kij / ikj
34
Cache Summary