Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
Optimizations based on :
• Reducing Miss Rate:
• Structural: Cache size, Associativity, Block size, Compiler support
Miss Categories:
• Compulsory: Cold-start (first-reference) misses
• Infinite cache miss rate
• Characteristic of the workload: e.g streams (majority of misses compulsory)
Replacement Algorithms:
Optimal off-line algorithm:
Belady Rule: Evict the cache block whose next reference is furthest in the future
Provides lower bound on the number of capacity misses for a given cache size
Evict B
B E C A
2 Capacity Misses
3
Cache Replacement
Replacement Algorithms:
Least Recently Used (LRU): Evict the cache block that was last referenced furthest in the past
Cache size: 4 Blocks
Block Access Sequence: A B C D E C E A D B C D E A B
5
LRU
2 additional misses due to
5. A B C D 5 Compulsory Misses (A, B, C, D, E)
non-optimal replacement
Evict A
Evict B
ABCD ECEADBCDEAB
Miss
D TOP E C E A
C D E C E
B C D D C
A LRU Block B B B D
Hits
On hit: Need to read and write ordering information: Not for hardware maintained cache
2
LRU
• Approximate LRU (Some Intel processors)
Left/Right accessed last?
R
R R
R R R R
A B C D E F G H
A,B,C,D,E,F,G,H
• Random Selection
Reducing Miss Rate
1. Larger cache size:
+ Reduce capacity misses - Hit time may increase
- Cost increase
2. Increased Associativity:
+ Miss rate decreases -- conflict misses - Hit time increases
may increase clock cycle time
- Hardware cost increases
Miss rate with 8-way associative comparable to fully associative (empirical finding)
Example
Direct mapped cache: Hit time 1 cycle, Miss Penalty 25 cycles (low!), Miss rate = 0.08
8-way set associative: Clock cycle 1.5x, Miss rate = 0.07
Let T be clock cycle of direct mapped cache
AMAT (direct mapped) = (1 + 0.08 x 25) x T = 3.0T
AMAT (set associative): New clock period = 1.5 x T + 0.07 x Miss Penalty
Miss Penalty = ceiling (25 x T /1.5T) x 1.5T = ceiling (25/1.5) x 1.5T = 17 x 1.5 T= 25.5T
5
AMAT = 1.5T+ 0.07 x 25.5T = T(1.5+1.785) = 3.285T
(Increasing associativity hurts in this example!!!)
Reducing Miss Rate
6
Reducing Miss Rate
12
Column (or pseudo) associative
Direct Map
0xxxx 0xxxx
1xxxx 1xxxx
Direct Map
12
Way Prediction
1xxxx
12
Reducing Miss Rate
5. Compiler Optimizations
• Instruction access
• Rearrange code (procedure, code block placements) to reduce conflict misses
• Align entry point of basic block with start of a cache block
a) Merging arrays: Replace parallel arrays with array of struct (spatial locality)
B
C for (k=0; k < m; k++)
m
D for (j=0; j < n; j++)
B
E a[k][j] = 0;
Cache Insensitive Matrix Multiplication: O(n3) cache misses for accessing matrix b elements
for (i=0; i < n; i++)
for (j=0; j < n; j++)
for (k=0; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
a b
10
Reducing Miss Rate
Compiler/Programmer Optimizations (contd …)
d) Blocking: Use block-oriented access to maximize both temporal and spatial locality
O(n3) cache misses for accessing matrix b elements
for (i=0; i < n/s; i++)
for (j=0; j < n/s; j++)
for (k=0; k < n/s; k++)
C[i][j] = C[i][j] +++ A[i][k] *** B[k][j];
Block Matrix Multiplication of A[i][k] with B[k][j] to get one update of
Matrix Addition C[i][j]
a
11