Cache Performance
Cache Performance
ECE 565
Performance Optimization & Parallelism
Duke University, Fall 2024
Motivation
• Memory Wall
– CPU speed and memory speed have grown at disparate rates
• CPU frequencies are much faster than memory frequencies
• Memory access takes many CPU cycles
– Hundreds, in fact!
– The latency of a load from memory will be in the 60-80ns range
• Cache hierarchy
– Caches are an integral part of current processor designs
• To reduce the impact of long memory latencies
– Cache hierarchies are often multiple levels today
• L1, L2, L3, sometimes L4
• Levels are larger and slower further down the hierarchy
int A[N];
int sum = 0;
for (i=0; i<N; i++) {
sum = sum + A[i];
}
• Miss Ratio
– Ratio of cache misses to total cache references
– Typically less than 10% for L1 cache, < 1% for an L2 cache
• Hit Time
– Time to deliver a line in the cache to the processor
– 2-3 CPU cycles for L1, 15-20 cycles for L2, ~40 cycles for L3
– 60-80ns for main memory (hundreds of cycles)
– Related concept is “load-to-use” time
• # of CPU cycles from the execution of a load instruction until
execution of an instruction that depends on the load value
• Miss Penalty
– Time required access a line from the next hierarchy level
• Average access time = hit time + (miss rate * miss penalty)
ECE 565 – Fall 2022 6
Cache Friendly Code
30
25
L3 region
20
CPU cycles
15
10
L2 region
5
L1 region
0
16384 32768 65536 131072 262144 524288 1048576 2097152
1 2 3 4 1 5 9 13 30
5 6 7 8 2 6 10 14
9 10 11 12 3 7 11 15
13 14 15 16 4 8 12 16
• 3 Loops – i, j, k
– 6 ways to arrange the loops and multiply the matrices
• O(N3) total operations
– N reads for each element of A and B
– N values to sum for each output element of C
i k i
A B C
• i-j-k
– Memory accesses for each inner loop iteration
• 2 loads: element A[i][k] and element B[k][j]
– A[i][k] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every iteration
• j-i-k cache miss behavior same as i-j-k
ECE 565 – Fall 2022 17
Matrix Multiplication
• k-i-j
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element B[k][j]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• i-k-j cache miss behavior same as k-i-j
ECE 565 – Fall 2022 18
Matrix Multiplication
• j-k-i
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element A[i][k]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• k-j-i cache miss behavior same as j-k-i
ECE 565 – Fall 2022 19
Matrix Multiplication Summary
• k is innermost loop
– A = good spatial locality
– C = good temporal locality
– Misses per iteration k j j
• 1 + (element sz/block sz)
• i is innermost loop i k i
– B = good temporal locality
– Misses per iteration A B C
• 2
• j is innermost loop
– B, C = good spatial locality
– A = good temporal locality
– Misses per iteration
• 2 * (element sz/block sz)
Process 0
Process N
• Page hit
– Memory reference to an address that is stored in physical mem
• Page miss (page fault)
– Reference to an address that is not in physical memory
– Misses are expensive
• Access to disk
• Software is involved in managing the process
Process 0
Process N
This looks
ECE 565 – Fall 2022 complicated!
28
What is Stored Where in Physical Memory?
Processor Chip
L1 D$
L2 $ Phys
CPU MMU
Virtual Physical Mem
L1 I$
address address
• TLB Reach
– Amount of memory accessible from the TLB
– Should cover the working set size of a process
– (# TLB entries) * (Page size)
• For example
– 64 TLB entries in L1 DTLB * 64KB pages = 4MB reach