Class11 Cache
Class11 Cache
Topics
Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance
Old values will be removed from cache to make space for new values
Spatial locality: If a value is used, nearby values are likely to be used Temporal locality: If a value is used, it is likely to be used again soon.
Cache Memories
Cache memories are small, fast SRAM-based memories managed automatically in hardware.
CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure:
CPU chip register file L1 cache cache bus ALU system bus memory bus I/O bridge main memory
CS105
L2 cache
3
bus interface
The tiny, very fast CPU register file has room for four 4-byte words
The transfer unit between the cache and main memory is a 4-word block (16 bytes) block 10
The small fast L1 cache has room for two 4-word blocks. It is an associative memory
abcd
...
block 21
pqrs
...
block 30
4
The big slow main memory has room for many 4-word blocks
wxyz
...
CS105
valid
set 0: valid
tag
tag
0
0
1
1
B1
E lines per set B1
valid
tag tag
1 1
B1 B1
set 1:
valid
CS105
Addressing Caches
Address A: t bits
v set 0: v v set 1: v tag tag tag tag 0 0 0 0 v set S-1: tag 0 0 1 B1 1 1 1 1 B1 B1 B1 B1
m-1
s bits
b bits
0
The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>
tag
B1
The word contents begin at offset <block offset> bytes from the beginning of the block
CS105
Direct-Mapped Cache
Simplest kind of cache Characterized by exactly one line per set
set 0: set 1:
valid
tag
valid
tag
set S-1:
valid
tag
cache block
CS105
valid valid
tag tag
cache block
cache block
m-1
t bits tag
tag
cache block
CS105
Line matching: Find a valid line in the selected set with a matching tag Word selection: Then extract the word
=1? (1) The valid bit must be set
0 1 2 3 4 5 6 7
0110
w0
w1 w2
w3
(2) The tag bits in the cache =? line must match the tag bits in the address
m-1
(3) If (1) and (2), then cache hit, and block offset selects starting byte
CS105
v
1 1
(1)
(3)
1 1
v
1 1
v
1 1
(4)
10
M[12-13]
(5)
1 1
High-Order Bit Indexing 0000x 0001x 0010x 0011x 0100x 0101x 0110x 0111x 1000x 1001x 1010x 1011x 1100x 1101x 1110x 1111x
CS105
Set-Associative Caches
Characterized by more than one line per set
set 0:
valid
valid valid
tag
tag tag
cache block
cache block cache block E=2 lines per set
set 1:
valid
tag
cache block
set S-1:
valid valid
tag tag
12
CS105
Selected set
set 1:
valid
valid
tag
tag
cache block
cache block
valid
m-1
tag tag
cache block
t bits tag
valid
cache block
13
CS105
Must compare the tag in each valid line in the selected set
=1? (1) The valid bit must be set
0 1 2 3 4 5 6 7
1001 0110 w0 w1 w2 w3
(2) The tag bits in one of the cache lines must match the tag bits in the address
m-1
=?
(3) If (1) and (2), then cache hit, and block offset selects starting byte s bits b bits 0 i 100 set index block offset
CS105
14
Write Strategies
On a Hit Write Through: Write to cache and to memory Write Back: Write just to cache. Write to memory only when a block is replaced. Requires a dirty bit
On a miss: Write Allocate: Allocate a cache line for the value to be written
Multi-Level Caches
Options: separate data and instruction caches, or a unified cache
Processor
Regs L1 d-cache L1 i-cache
Unified L2 Cache
Memory
disk
200 B 3 ns
8-64 KB 3 ns
30 GB 8 ms $0.05/MB
16
CS105
Regs.
17
CS105
Hit Time
Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache)
Typical numbers:
1 clock cycle for L1 3-8 clock cycles for L2
Miss Penalty
int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; }
Memory mountain
Measured read throughput as a function of spatial and temporal locality Compact way to characterize memory system performance
20
CS105
/* Working set size (in bytes) */ /* Stride (in array elements) */ /* Clock frequency */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz)); printf("\n"); } exit(0); }
22 CS105
1000
L1
800
Pentium III Xeon 550 MHz 16 KB on-chip L1 d-cache 16 KB on-chip L1 i-cache 512 KB off-chip unified L2 cache
600
400
xe
L2
200
s1
s3
s7
512k
s13
stride (words)
128k
s11
32k
s9
8k
s5
s15
23
8m
2m
2k
working set size (bytes)
mem
CS105
800
600
400
200
8m
4m
2m
8k
4k
2k
64k
32k
512k
256k
24
1024k
128k
16k
1k
CS105
25
CS105
Matrix-Multiplication Example
Major Cache Effects to Consider
blocking)
Block size
Exploit spatial locality
Description:
/* ijk */ Variable sum for (i=0; i<n; i++) { held in register for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } }
Line size = 32B (big enough for 4 64-bit words) Matrix dimension (N) is very large
Approximate 1/N as 0.0
Analysis Method:
27
CS105
Accesses successive elements of size k bytes If block size (B) > k bytes, exploit spatial locality
compulsory miss rate = k bytes / B
28
(i,j)
(i,*) A B C
Row-wise
Columnwise
Fixed
B 1.0
C 0.0
CS105
(i,*)
A B C
Row-wise Columnwise
Fixed
B 1.0
C 0.0
CS105
Fixed
Row-wise Row-wise
B 0.25
C 0.25
CS105
Inner loop:
(i,k) A B (k,*) (i,*) C
Fixed
Row-wise Row-wise
B 0.25
C 0.25
CS105
Column wise
Fixed
Columnwise
B 0.0
C 1.0
CS105
Columnwise
Fixed
Columnwise
B 0.0
C 1.0
CS105
misses/iter = 1.25
for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } }
misses/iter = 0.5
for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } }
misses/iter = 2.0
for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } }
35
CS105
50
40
Cycles/iteration
30
20
10
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
36
CS105
Block (in this context) does not mean cache block Instead, it means a sub-block within the matrix Example: N = 8; sub-block size = 4
A11 A12 A21 A22 X B11 B12 = B21 B22 C21 C22 C11 C12
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars C11 = A11B11 + A12B21 C21 = A21B11 + A22B21
37
CS105
38
CS105
for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; Innermost } kk jj jj Loop Pair
i kk i
39
Update successive row sliver accessed elements of sliver bsize times block reused n times in succession CS105
60
50
Cycles/iteration
40
30
20
10
kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25)
40
75 10 0 12 5 15 0 17 5 20 0 22 5 25 0 27 5 30 0 32 5 35 0 37 5 40 0
25
50
CS105
Concluding Observations
Programmer can optimize for cache performance
41
CS105