Lec 34

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

CS222: Cache Performance

Improvement

Dr. A. Sahu
Dept of Comp. Sc. & Engg.
Indian Institute of Technology Guwahati
Outline
• Eleven Advanced Cache Performance
Optimization
– Prev: Reducing hit time & Increasing bandwidth
– Prev: Reducing
d miss penalty
l
– Reducing miss rate
– Reducing
R d i missi penaltylt * miss
i rate
t
Eleven Advanced Optimization for
Cache
h Performance
f
• Reducing hit time
• Reducing miss penalty
• Reducing miss rate
• Reducing miss penalty * miss rate

Ref: 5.2, Computer Architecture: A Quantitative


Approach,
pp Hennessyy Patterson Book, 4th Edition,
PDF Version Available on Course website (Intranet)
Reducing Hit Time
• S a and
Small a d ssimple
p e cac
caches
es
• Pipelined cache access
• Trace caches
• Avoid time loss in address translation (Out of
scope
p of this course: First Read OS))
– Virtually indexed, physically tagged cache
• simple and effective approach
• possible
bl only
l iff cache
h is not too llarge
– Virtually addressed cache
• protection?, multiple processes?, aliasing?, I/O?
Reducing Miss Penalty
• Multi level caches
• Critical word first and early restart
• Giving priority to read misses over write
• Merging write buffer
• Victim caches
Reducing Miss Rate
• Large Block Size
• Larger Cache
• Higher
i h Associativity
i i i
• Way prediction and pseudo‐associative cache
• Compiler optimizations
Large Block Size
• Take benefit of spatial locality
• Reduces compulsory misses
• Too large
l bl
blockk size
i ‐ misses
i increase
i
• Miss Penalty increases
Large Cache
• Reduces capacity misses
• Hit time increases
• Keep small L1 cache and large L2 cache
Higher Associativity
• Reduces conflict misses
• 8‐way is almost like fully associative
• Hit time increases: What to do ?
Pseudo Associativity
–Pseudo
Way Prediction and Pseudo‐associative
Cache
Way prediction: low miss rate of SA cache
with hit time of DM cache
• Onlyy one tagg is compared
p initiallyy
• Extra bits are kept for prediction
p
• Hit time in case of mis‐prediction is high
g
Pseudo‐assoc. or column assoc. cache: get
Pseudo‐
advantage
g of SA cache in a DM cache
• Check sequentially in a pseudo‐set
• Fast hit and slow hit
Compiler optimizations
Loop interchange
• Improve spatial locality by scanning arrays
row‐wise
Blocking
• Improve temporal and spatial locality
Improving Locality
Matrix Multiplication example

[C ] = [ A] × [B ]
L×M L× N N ×M
Cache Organization for the example

• Cache line (or block) = 4 matrix elements.


• Matrices are stored row wise.
• Cache can’t
can t accommodate a full row/column
row/column.
– L, M and N are so large w.r.t. the cache size
– After an iteration along any of the three indices, when an
element
l t is
i accessed d again,
i it results
lt iin a miss.
i
• Ignore misses due to conflict between matrices.
– As if there was a separate cache for each matrix
matrix.
Matrix Multiplication : Code I
for (
(i = 0; i < L; i++)
)
for (j = 0; j < M; j++)
for (k = 0; k < N; k++)
c[i][j] += A[i][k] * B[k][j];

C A B
accesses LM LMN LMN
misses LM/4 LMN/4 LMN

Total misses = LM(5N+1)/4


( )/
Matrix Multiplication : Code II
for (
(k = 0; k < N; k++)
)
for (i = 0; i < L; i++)
for (j = 0; j < M; j++)
c[i][j] += A[i][k] * B[k][j];

C A B
accesses LMN LN LMN
misses LMN/4 LN LMN/4

Total misses = LN(2M+4)/4


( )/
Matrix Multiplication : Code III
for (
(i = 0; i < L; i++)
)
for (k = 0; k < N; k++)
for (j = o; j < M; j++)
c[i][j] += A[i][k] * B[k][j];

C A B
accesses LMN LN LMN
misses LMN/4 LN/4 LMN/4

Total misses = LN(2M+1)/4


( )/
Reducing MissRate*MissPenality
Reducing Miss Penalty * Miss Rate

• Non‐blocking
Non blocking cache
• Hardware prefetching
• Compiler
C il controlled
ll d prefetching
f hi
Non‐blocking Cache

In OOO processor

• Hit under a miss


– complexity
p y of cache controller increases
• Hit under multiple misses or miss under a miss
– memory should be able to handle multiple misses
Hardware Prefetching
• Prefetch items before they are requested
– both data and instructions
• What and when to prefetch?
– fetch two blocks on a miss (requested+next)
• Where to keep prefetched information?
– in cache
– in a separate buffer (most common case)
Prefetch Buffer/Stream Buffer
to proc

C h
Cache

prefetch
buffer

from mem
Compiler Controlled Pre
Pre‐fetching
fetching
• Semantically invisible (no change in registers
or cache contents)
• Makes sense if processor doesn’t
doesn t stall while
prefetching (non‐blocking cache)
• Overhead
O h d off prefetch
f h instruction
i i should
h ld not
exceed the benefit
SW Prefetch Example
• 8 KB direct mapped, write back data cache with
16 byte blocks.
blocks
• a is 3 × 100, b is 101 × 3

for (i = 0; i < 3; i++)


for (j = 0; j < 100; j++)
a[i][j]
[i][j] = b[j][0] * b[j+1][0]
b[j+1][0];

each array element is 8 bytes


misses in array a = 3 * 100 /2 = 150
misses in array b = 101
total misses = 251
SW Prefetch Example – contd.
Suppose we need to prefetch 7 iterations in
advance
for (j = 0; j < 100; j++){
prefetch(b[j+7]][0]);
prefetch(a[0][j+7]);
a[0][j] = b[j][0] * b[j+1][0];};
for (i = 1; i < 3; i++)
f
for (j = 0;
0 j < 100
100; j++){
prefetch(a[i][j+7]);
a[i][j] = b[j][0] * b[j+1][0];};

misses in first loop = 7 (for b[0..6][0]) + 4 (for a[0][0..6] )


misses in second loop = 4 (for a[1][0..6])
a[1][0 6]) + 4 (for a[2][0
a[2][0..6]
6] )
total misses = 19, total prefetches = 400
SW Prefetch Example – contd.
Performance improvement?
Assume no capacity and conflict misses
misses,
prefetches overlap with each other and with misses
Original loop: 7, Prefetch loops: 9 and 8 cycles
Miss penalty = 100 cycles

Original loop = 300*7 + 251*100 = 27,200 cycles


1st p
prefetch loop
p = 100*9 + 11*100 = 2,000
, cycles
y
2nd prefetch loop = 200*8 + 8*100 = 2,400 cycles
Speedup = 27200/(2000+2400) = 6.2

You might also like