Lec 34

CS222: Cache Performance
Improvement
Dr. A. Sahu
Dept of Comp. Sc. & Engg.
Indian Institute of Technology Guwahati
Outline
• Eleven Advanced Cache Performance
Optimization
– Prev: Reducing hit time & Increasing bandwidth
– Prev: Reducing
d miss penalty
l
– Reducing miss rate
– Reducing
R d i missi penaltylt * miss
i rate
t
Eleven Advanced Optimization for
Cache
h Performance
f
• Reducing hit time
• Reducing miss penalty
• Reducing miss rate
• Reducing miss penalty * miss rate
Ref: 5.2, Computer Architecture: A Quantitative

Approach,
pp Hennessyy Patterson Book, 4th Edition,
PDF Version Available on Course website (Intranet)
Reducing Hit Time
• S a and
Small a d ssimple
p e cac
caches
es
• Pipelined cache access
• Trace caches
• Avoid time loss in address translation (Out of
scope
p of this course: First Read OS))
– Virtually indexed, physically tagged cache
• simple and effective approach
• possible
bl only
l iff cache
h is not too llarge
– Virtually addressed cache
• protection?, multiple processes?, aliasing?, I/O?
Reducing Miss Penalty
• Multi level caches
• Critical word first and early restart
• Giving priority to read misses over write
• Merging write buffer
• Victim caches
Reducing Miss Rate
• Large Block Size
• Larger Cache
• Higher
i h Associativity
i i i
• Way prediction and pseudo‐associative cache
• Compiler optimizations
Large Block Size
• Take benefit of spatial locality
• Reduces compulsory misses
• Too large
l bl
blockk size
i ‐ misses
i increase
i
• Miss Penalty increases
Large Cache
• Reduces capacity misses
• Hit time increases
• Keep small L1 cache and large L2 cache
Higher Associativity
• Reduces conflict misses
• 8‐way is almost like fully associative
• Hit time increases: What to do ?
Pseudo Associativity
–Pseudo
Way Prediction and Pseudo‐associative
Cache
Way prediction: low miss rate of SA cache
with hit time of DM cache
• Onlyy one tagg is compared
p initiallyy
• Extra bits are kept for prediction
p
• Hit time in case of mis‐prediction is high
g
Pseudo‐assoc. or column assoc. cache: get
Pseudo‐
advantage
g of SA cache in a DM cache
• Check sequentially in a pseudo‐set
• Fast hit and slow hit
Compiler optimizations
Loop interchange
• Improve spatial locality by scanning arrays
row‐wise
Blocking
• Improve temporal and spatial locality
Improving Locality
Matrix Multiplication example
[C ] = [ A] × [B ]
L×M L× N N ×M
Cache Organization for the example
• Cache line (or block) = 4 matrix elements.

• Matrices are stored row wise.
• Cache can’t
can t accommodate a full row/column
row/column.
– L, M and N are so large w.r.t. the cache size
– After an iteration along any of the three indices, when an
element
l t is
i accessed d again,
i it results
lt iin a miss.
i
• Ignore misses due to conflict between matrices.
– As if there was a separate cache for each matrix
matrix.
Matrix Multiplication : Code I
for (
(i = 0; i < L; i++)
)
for (j = 0; j < M; j++)
for (k = 0; k < N; k++)
c[i][j] += A[i][k] * B[k][j];
C A B
accesses LM LMN LMN
misses LM/4 LMN/4 LMN
Total misses = LM(5N+1)/4

( )/
Matrix Multiplication : Code II
for (
(k = 0; k < N; k++)
)
for (i = 0; i < L; i++)
for (j = 0; j < M; j++)
c[i][j] += A[i][k] * B[k][j];
C A B
accesses LMN LN LMN
misses LMN/4 LN LMN/4
Total misses = LN(2M+4)/4

( )/
Matrix Multiplication : Code III
for (
(i = 0; i < L; i++)
)
for (k = 0; k < N; k++)
for (j = o; j < M; j++)
c[i][j] += A[i][k] * B[k][j];
C A B
accesses LMN LN LMN
misses LMN/4 LN/4 LMN/4
Total misses = LN(2M+1)/4

( )/
Reducing MissRate*MissPenality
Reducing Miss Penalty * Miss Rate
• Non‐blocking
Non blocking cache
• Hardware prefetching
• Compiler
C il controlled
ll d prefetching
f hi
Non‐blocking Cache
In OOO processor
• Hit under a miss

– complexity
p y of cache controller increases
• Hit under multiple misses or miss under a miss
– memory should be able to handle multiple misses
Hardware Prefetching
• Prefetch items before they are requested
– both data and instructions
• What and when to prefetch?
– fetch two blocks on a miss (requested+next)
• Where to keep prefetched information?
– in cache
– in a separate buffer (most common case)
Prefetch Buffer/Stream Buffer
to proc
C h
Cache
prefetch
buffer
from mem
Compiler Controlled Pre
Pre‐fetching
fetching
• Semantically invisible (no change in registers
or cache contents)
• Makes sense if processor doesn’t
doesn t stall while
prefetching (non‐blocking cache)
• Overhead
O h d off prefetch
f h instruction
i i should
h ld not
exceed the benefit
SW Prefetch Example
• 8 KB direct mapped, write back data cache with
16 byte blocks.
blocks
• a is 3 × 100, b is 101 × 3
for (i = 0; i < 3; i++)

for (j = 0; j < 100; j++)
a[i][j]
[i][j] = b[j][0] * b[j+1][0]
b[j+1][0];
each array element is 8 bytes

misses in array a = 3 * 100 /2 = 150
misses in array b = 101
total misses = 251
SW Prefetch Example – contd.
Suppose we need to prefetch 7 iterations in
advance
for (j = 0; j < 100; j++){
prefetch(b[j+7]][0]);
prefetch(a[0][j+7]);
a[0][j] = b[j][0] * b[j+1][0];};
for (i = 1; i < 3; i++)
f
for (j = 0;
0 j < 100
100; j++){
prefetch(a[i][j+7]);
a[i][j] = b[j][0] * b[j+1][0];};
misses in first loop = 7 (for b[0..6][0]) + 4 (for a[0][0..6] )

misses in second loop = 4 (for a[1][0..6])
a[1][0 6]) + 4 (for a[2][0
a[2][0..6]
6] )
total misses = 19, total prefetches = 400
SW Prefetch Example – contd.
Performance improvement?
Assume no capacity and conflict misses
misses,
prefetches overlap with each other and with misses
Original loop: 7, Prefetch loops: 9 and 8 cycles
Miss penalty = 100 cycles
Original loop = 300*7 + 251*100 = 27,200 cycles

1st p
prefetch loop
p = 100*9 + 11*100 = 2,000
, cycles
y
2nd prefetch loop = 200*8 + 8*100 = 2,400 cycles
Speedup = 27200/(2000+2400) = 6.2

Lec 34

Uploaded by

Copyright:

Available Formats

Lec 34

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 34

Uploaded by

Copyright:

Available Formats

CS222: Cache Performance

Ref: 5.2, Computer Architecture: A Quantitative

• Cache line (or block) = 4 matrix elements.

Total misses = LM(5N+1)/4

Total misses = LN(2M+4)/4

Total misses = LN(2M+1)/4

• Hit under a miss

for (i = 0; i < 3; i++)

each array element is 8 bytes

misses in first loop = 7 (for b[0..6][0]) + 4 (for a[0][0..6] )

Original loop = 3007 + 251100 = 27,200 cycles

You might also like