0% found this document useful (0 votes)
41 views16 pages

Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty

This document discusses techniques for improving cache performance. It describes that average memory access time (AMAT) is determined by hit time, miss rate, and miss penalty. Optimizations aim to reduce miss rate through larger caches, higher associativity, and compiler support, and to reduce miss penalty through multi-level caches and latency hiding. The document also describes cache performance models, different types of cache misses, replacement algorithms like LRU, and methods for reducing miss rate such as larger block sizes, pseudo-associative caches, and way prediction.

Uploaded by

Biplob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views16 pages

Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty

This document discusses techniques for improving cache performance. It describes that average memory access time (AMAT) is determined by hit time, miss rate, and miss penalty. Optimizations aim to reduce miss rate through larger caches, higher associativity, and compiler support, and to reduce miss penalty through multi-level caches and latency hiding. The document also describes cache performance models, different types of cache misses, replacement algorithms like LRU, and methods for reducing miss rate such as larger block sizes, pseudo-associative caches, and way prediction.

Uploaded by

Biplob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Improving Cache Performance

AMAT: Average Memory Access Time


AMAT = Thit + Miss Rate x Miss Penalty

Optimizations based on :
• Reducing Miss Rate:
• Structural: Cache size, Associativity, Block size, Compiler support

• Reducing Miss Penalty


• Structural: Multi-level caches, Critical word/Early Restart,
• Latency Hiding: Using concurrency to reduce miss rate or miss penalty

• Improving Hit Time


1
Cache Performance Models
Temporal Locality: Repeated access to the same word
Spatial Locality: Access to words in physical proximity to accessed word

Miss Categories:
• Compulsory: Cold-start (first-reference) misses
• Infinite cache miss rate
• Characteristic of the workload: e.g streams (majority of misses compulsory)

• Capacity: Data set size larger than that of cache


• Increase size of cache to avoid thrashing
• Fully associative abstraction

Replacement Algorithms:
Optimal off-line algorithm:
Belady Rule: Evict the cache block whose next reference is furthest in the future
Provides lower bound on the number of capacity misses for a given cache size

• Conflict: Cache organizations causes block to be discarded and later retrieved 2

• Collision or Interference misses


Cache Replacement
Replacement Algorithms:
Optimal off-line algorithm:
Belady Rule: Evict the cache block whose next reference is furthest in the future
Provides lower bound on the number of capacity misses for a given cache size
Cache size: 4 Blocks
Block Access Sequence: A B C D E C E A D B C D E A B
5
OPTIMAL (Belady)

5. A B C D 5 Compulsory Misses (A, B, C, D, E)

Evict B

ABCD ECEAD BCDEAB


6. A E C D
6
Evict A

ABCD ECEAD BCDE AB


7. B E C D 7
Evict D (or E or C)

B E C A
2 Capacity Misses

3
Cache Replacement
Replacement Algorithms:
Least Recently Used (LRU): Evict the cache block that was last referenced furthest in the past
Cache size: 4 Blocks
Block Access Sequence: A B C D E C E A D B C D E A B
5

LRU
2 additional misses due to
5. A B C D 5 Compulsory Misses (A, B, C, D, E)
non-optimal replacement
Evict A

ABCD ECE AD BCDEAB


6. E B C D
6

Evict B

ABCD ECEAD BCD EAB


7. B A C D 7
Evict A

8. E C D ABCD ECEAD BCD E AB


B
8

Evict B ABCD ECEAD BCD E A B


9. A E C D
9 4
Evict C
LRU
• Hard to implement efficiently

• Software: LRU Stack

ABCD ECEADBCDEAB
Miss

D TOP E C E A
C D E C E
B C D D C
A LRU Block B B B D

Hits

On hit: Need to read and write ordering information: Not for hardware maintained cache

2
LRU
• Approximate LRU (Some Intel processors)
Left/Right accessed last?
R

R R

R R R R

A B C D E F G H

A,B,C,D,E,F,G,H

On Miss: Follow the path of NOT Accessed last

• Random Selection
Reducing Miss Rate
1. Larger cache size:
+ Reduce capacity misses - Hit time may increase
- Cost increase
2. Increased Associativity:
+ Miss rate decreases -- conflict misses - Hit time increases
may increase clock cycle time
- Hardware cost increases

Miss rate with 8-way associative comparable to fully associative (empirical finding)

Example
Direct mapped cache: Hit time 1 cycle, Miss Penalty 25 cycles (low!), Miss rate = 0.08
8-way set associative: Clock cycle 1.5x, Miss rate = 0.07
Let T be clock cycle of direct mapped cache
AMAT (direct mapped) = (1 + 0.08 x 25) x T = 3.0T
AMAT (set associative): New clock period = 1.5 x T + 0.07 x Miss Penalty
Miss Penalty = ceiling (25 x T /1.5T) x 1.5T = ceiling (25/1.5) x 1.5T = 17 x 1.5 T= 25.5T
5
AMAT = 1.5T+ 0.07 x 25.5T = T(1.5+1.785) = 3.285T
(Increasing associativity hurts in this example!!!)
Reducing Miss Rate

3. Block Size (B):


• Miss rate
• decreases and then increases with increasing block size
+ a) Compulsory miss rate decreases due to better use of spatial locality
- b) Capacity (conflict) misses increase as effective cache size decreases
• Miss penalty
• increases with increasing block size
- c) Wasted memory access time: Miss penalty increase not providing any gain
Do (a) and (c) balance each other?
+ d) Amortized memory access time per byte decreases (burst mode memory)
• Tag overhead decreases

Low latency, Low bandwidth memory: Smaller block size


High latency, High bandwidth: Larger block size

6
Reducing Miss Rate

Block Size B (contd):


Low latency, Low bandwidth memory: Smaller block size
High latency, High bandwidth: Larger block size
Example:
Case 1: Miss ratio of 5% with B=8 and Case 2: Miss ratio of 4% with B=16.
Burst-mode Memory:
Memory latency of 8 cycles,
Transfer rate 2 bytes/cycle.

Cache Hit time 1 cycle.


AMAT = Hit time + Miss Rate x Miss penalty
Case 1: AMAT = 1 + 5% x (8 + 8/2) = 1.6 cycles
Case 2: AMAT = 1 + 4% x (8 + 16/2) = 1.64 cycles
Suppose memory latency was 16 cycles: Favors larger block size.
Case 1: AMAT = 1 + 5% x (16 + 8/2) = 2.0 cycles
Case 2: AMAT = 1 + 4% x (16 + 16/2) = 1.96 cycles
7
Reducing Miss Rate
4. Pseudo Associative caches
+ Maintain hit-speed of direct mapped.
+ Reduce conflict misses

Column (or pseudo) associative:


On miss: check one more location in Direct Mapped Cache
Like having a fixed way-prediction

Way Prediction: Predict block in set to be read on next access.


If tag match: 1 cycle hit
If failure: do complete selection on subsequent cycles

+ Power savings potential


- Poor prediction increases hit time

12
Column (or pseudo) associative

On miss: check one more location in Direct Mapped Cache


Like having a fixed way-prediction

Direct Map
0xxxx 0xxxx

1xxxx 1xxxx

Direct Map

Alternate Cache Location for Green block

12
Way Prediction

Predict block in set to be read on next access.


If tag match: 1 cycle hit
If failure: do complete selection on subsequent cycles
2-way set associative map
0xxxx 0xxxx

1xxxx

12
Reducing Miss Rate
5. Compiler Optimizations

• Instruction access
• Rearrange code (procedure, code block placements) to reduce conflict misses
• Align entry point of basic block with start of a cache block

• Data access: Improve spatial/temporal locality in arrays

a) Merging arrays: Replace parallel arrays with array of struct (spatial locality)

update(j): { *name[j] = …; id[j] = …; age[j] = …; salary[j] = …; }

update(j): { *(person[j].name) = …; person[j].id = …; person[j].age = …; person[j].salary = …;}

When might separate arrays be better?


b) Loop Fusion: Combine loops which use the same data (temporal locality)
for (j=0; < n; j++) x[j] = y[2 * j]; for (j=0; j < n; j++) {
for (j=0; < n; j++) sum += x[j]; x[j] = y[2 * j];
sum += x[j] ;
} 8

When might separate loops be better?


Reducing Miss Rate
Compiler Optimizations (contd …)

• Data access: Improve spatial/temporal locality in arrays

c) Loop interchange: Convert column-major matrix access to row-major access (spatial)


n
A
A P P

B
C for (k=0; k < m; k++)
m
D for (j=0; j < n; j++)
B
E a[k][j] = 0;

for (j=0; j < n; j++) Only compulsory misses:


C 1 per block:
for (k=0; k < m; k++)
a[k][j] = 0; F
Array element size w bytes
Block size B bytes
B/w elements per block
Assuming Row-Major storage in memory:
Misses: mn/ (B/w) = mnw/B
Could miss on each access of a[ ][ ]
9
Misses: mn
Reducing Miss Rate
Compiler/Programmer Optimizations (contd …)
d) Blocking: Use block-oriented access to maximize both temporal and spatial locality

Cache Insensitive Matrix Multiplication: O(n3) cache misses for accessing matrix b elements
for (i=0; i < n; i++)
for (j=0; j < n; j++)
for (k=0; k < n; k++)
c[i][j] += a[i][k] * b[k][j];

a b

10
Reducing Miss Rate
Compiler/Programmer Optimizations (contd …)
d) Blocking: Use block-oriented access to maximize both temporal and spatial locality
O(n3) cache misses for accessing matrix b elements
for (i=0; i < n/s; i++)
for (j=0; j < n/s; j++)
for (k=0; k < n/s; k++)
C[i][j] = C[i][j] +++ A[i][k] *** B[k][j];
Block Matrix Multiplication of A[i][k] with B[k][j] to get one update of
Matrix Addition C[i][j]

Block Matrix A[0][0] Block Matrix B[0][0]


s

a
11

You might also like