Lecture 5 Cache Optimization
Lecture 5 Cache Optimization
Appendix B and Ch 2
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
Reduced
compulsory 0%
16
32
64
misses 128
256
Increased
Conflict
Block Size (bytes)
Misses
Pentium 4 Pre-fetching
Intel Core i7 supports hardware prefetching to both L1 and L2 caches
/*Before*/ /*After*/
for(j=0;j<100;j++) for(i=0;i<5000;i++)
for(i=0;i<5000;i++) for(j=0;j<100;j++)
x[i][j]=2*x[i][j] x[i][j]=2*x[i][j]
DAP Spr.‘98 ©UCB 11
EX. Block Matrix Algorithm
• Operate on submatrices (blocks) instead of entire
row or columns.
• The submatrices can fit into cache.
/*Before*/ /*After*/
for(i=0;i<N;i++) for(jj=0;jj<N;jj=jj+B) // among blocks
for(j=0;j<N;j++){ for(kk=0;kk<N;kk=kk+B) // among blocks
r=0; for(i=0;i<N;i++)
for(k=0;k<N;k++) for(j=jj;j<jj+B;j++){ // within a block
r=r+y[i][k]*z[k][j]; r=0;
x[i][j] = r;} for(k=kk;k<kk+B;k++) // within a block
r=r+y[i][k]*z[k][j];
x[i][j] = x[i][j] +r;
}
block
• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
• Definitions:
– Local miss rate— misses in this cache divided by the total number
of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
Global Miss Rate is what matters
DAP Spr.‘98 ©UCB 22
Comparing Local and Global
• 32 KB 1st level cache; Miss Rates
Increasing 2nd level cache
• Local miss rate is for L2 – Very
high for small L2 size
• Single cache miss rate is the rate
if we have one cache of size in x-
axis
• Global miss rate close to single Cache Size
level cache rate provided L2 >>
L1
Log
• The idea is to reduce miss
penalty without increasing the
miss rate
• L1 speed affects the CPU clock
cycle, but not L2 speed, L2 only
affects the miss penalty of the
first-level cache
DAP Spr.‘98 ©UCB 23
AMAT Example
• For every 1000 memory references, assume 40
misses in L1 and 20 misses in L2;
Hit time in L1 is 1, L2 is 10; Miss penalty from L2 to
memory is 100 cycles; there are 1.5 memory
references per instruction. What is AMAT and
average stall cycles per instruction?
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2
– AMAT = [1 + 40/1000 * (10 + 20/40 * 100) ] *cc = 3.4 cycles
– AMAT without L2 = 1 + 40/1000 * 100 = 5 cycles => An
improvement of 1.6 cycles due to L2
• Average memory stalls per instruction = Misses per instructionL1 × Hit
timeL2 + Misses per instructionL2 × Miss penaltyL2
– Average stall cycles per instruction = 1.5 * 40/1000 * 10 + 1.5 *
20/1000 * 100 = 3.6 cycles
• Note: We have not distinguished reads and writes.
Access L2 only on L1 miss, No separate I-cache and
D-cache
DAP Spr.‘98 ©UCB 24
Reducing Miss Penalty Summary
Memory accesses
CPUtime IC CPI Miss rate Miss penalty Clock cycle time
Executi on
Instruction
• Four techniques
1. Read priority over write on miss
2. Early Restart and Critical Word First on miss
3. Write Buffer
4. Second Level Cache
• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple
levels of cache memories
– First attempts (compulsory misses) at L2 caches can
make things worse, since increased worst case is worse