0% found this document useful (0 votes)
22 views

Cache Performance

Uploaded by

52068838a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Cache Performance

Uploaded by

52068838a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Coding for Cache Optimization

ECE 565
Performance Optimization & Parallelism
Duke University, Fall 2024
Motivation

• Memory Wall
– CPU speed and memory speed have grown at disparate rates
• CPU frequencies are much faster than memory frequencies
• Memory access takes many CPU cycles
– Hundreds, in fact!
– The latency of a load from memory will be in the 60-80ns range

• Cache hierarchy
– Caches are an integral part of current processor designs
• To reduce the impact of long memory latencies
– Cache hierarchies are often multiple levels today
• L1, L2, L3, sometimes L4
• Levels are larger and slower further down the hierarchy

ECE 565 – Fall 2022 2


Motivation (2)

• Cache hierarchy works on principle of locality


– Temporal locality – recently referenced memory addresses are
likely to be referenced again soon
– Spatial locality – memory addresses near recently referenced
memory addresses are likely to be referenced soon
• Locality allows a processor to retrieve a large portion of
memory references from the cache hierarchy
• Thus the programs we write should exhibit good locality!
– Good news – this is typically true of well-written programs

ECE 565 – Fall 2022 3


Locality Example

• Similar to the loop scenarios we’ve discussed –

int A[N];
int sum = 0;
for (i=0; i<N; i++) {
sum = sum + A[i];
}

• Data locality for sum and elements of array A[]


– Temporal locality for sum, spatial locality for A[]
• Code locality for program instructions
– Loops create significant inherent code temporal locality
– Sequential streams of instructions create spatial locality

ECE 565 – Fall 2022 4


Single Slide Cache Reminder
Core
uint64_t A[N];
Move Cache Blocks, e.g. 64B

L1 $ for (int i=0; i < N; i++) {


A[i] = seed;
}
L2 Cache
What happens when CPU
loads A[0] for first time?
Assume 32B cache blocks
L3 Cache
A[0] A[1] A[2] A[3] A[4] A[5]

Fetch A[0-3] (4 x 8B values)


Memory
What happens if CPU next
loads A[5]? Fetch A[4-7]
ECE 565 – Fall 2022 Cache blocks are aligned 5
Important Cache Performance Metrics

• Miss Ratio
– Ratio of cache misses to total cache references
– Typically less than 10% for L1 cache, < 1% for an L2 cache
• Hit Time
– Time to deliver a line in the cache to the processor
– 2-3 CPU cycles for L1, 15-20 cycles for L2, ~40 cycles for L3
– 60-80ns for main memory (hundreds of cycles)
– Related concept is “load-to-use” time
• # of CPU cycles from the execution of a load instruction until
execution of an instruction that depends on the load value

• Miss Penalty
– Time required access a line from the next hierarchy level
• Average access time = hit time + (miss rate * miss penalty)
ECE 565 – Fall 2022 6
Cache Friendly Code

• Strongly-related to the benefits we discussed for certain


loop transformations
• Examples:
– Cold cache, 4-byte words, 4-word cache blocks

for (i=0; i < N; i++) { for (j=0; j < N; j++) {


for (j=0; j < N; j++) { for (i=0; i < N; i++) {
sum += a[i][j]; sum += a[i][j];
} }
} }
Miss rate = ¼ = 25% Miss rate = 100%

ECE 565 – Fall 2022 7


Reverse-engineering a Cache

• Assume you have a machine


• You do not know its
– Cache sizes (or number of levels of cache)
– Cache block sizes
– Cache associativity
– Latencies
– Data bandwidth
• We can find these out through test programs!
– Write targeted code
– Measure performance characteristics
– Analyze the measurements

ECE 565 – Fall 2022 8


In-Class Exercise

• Code a targeted test program


P to determine
– # caches are in our machine
L1 $ – Size of each cache
– Latency of each cache
L2 $ • Assumptions
– LRU replacement policy in use
for each cache
L3 $ – Know cache block size
– Each level of cache hierarchy
has a different access latency
Main Memory

ECE 565 – Fall 2022 9


Test Code Summary
• (See links to code files on the class schedule page)
• Repeatedly access elements of a data set of some size
– E.g. an array
– Each memory access should depend on value from prior access
• E.g. pointer chasing
• This exposes memory latency of each access
• Record exectution time for this set of memory accesses
– Calculate average latency from
• measured time
• known # of accesses
• When data set size grows larger than the size of a cache level –
– We will see a step in the measured average access latency
– Latency will stay constant while the data set size fits within a cache level
• Loop of repeated memory accesses can be unrolled
– To reduce the interference from loop management instructions
• Access data set w/ fixed stride so each access touches a new cache block
• Can randomly cycle through data set elements to defeat prefetch
– Prefetch could disrupt measurement and blur the transition between cache levels

ECE 565 – Fall 2022 10


Example Results

30

25

L3 region
20
CPU cycles

15

10

L2 region
5

L1 region
0
16384 32768 65536 131072 262144 524288 1048576 2097152

Data Set Size (B)

* Measured on an Intel Core i7 CPU @ 2.4 GHz


* Program compiled with gcc -O3
ECE 565 – Fall 2022 11
Cache Access Patterns
stride = 4 words

data set = 24 words

• Vary the data set accessed by our code


– As data set grows larger than a cache level, performance drops
• Performance can be measured as latency or bandwidth

• Vary the stride of data accessed by our code


– Affects spatial locality provided by cache blocks
– If stride is less than size of a cache line –
• Initial access may cause cache miss, sequential accesses are fast
– If stride is greater than size of a cache line –
• Sequential accesses are slow

ECE 565 – Fall 2022 12


Memory Access Latency vs. Bandwidth

• Thus far, we’ve focused mostly on latency


– Hit time for a cache level or memory
– E.g. we put together example code using pointer chasing
• Stresses the latency of each access
• Only one memory access in flight at a given time

• Cache and memory bandwidth is also important P


– Bandwidth is a rate L1 $
• Bytes per cycle
L2 $
• GB per second
– Bandwidth gets smaller at lower levels of memory hierarchy L3 $

• Just as latency grows larger Main Memory


– Some code is throughput sensitive, not latency sensitive
• Code performance would improve with higher access bandwidth
– Even if access latency increased

ECE 565 – Fall 2022 13


Matrix Multiplication

• Common operation in scientific applications


• Significant interaction with cache & memory subsystem

1 2 3 4 1 5 9 13 30
5 6 7 8 2 6 10 14
9 10 11 12 3 7 11 15
13 14 15 16 4 8 12 16

= 1*1 + 2*2 + 3*3 + 4*4

• Recall our memory layout discussion


– E.g. C/C++ uses row-major order
– 2D array is allocated as a linear array in memory

ECE 565 – Fall 2022 14


Matrix Multiplication Implementation

double A[N], B[N], C[N];


int i, j, k;
double sum; k j j
for (i=0; i<N; i++) {
for (j=0; j<N; j++) { 1 2 3 4 1 5 9 13 30
sum = 0; 5 6 7 8 k 2 6 1014
i 9 101112 3 7 1115
i
for (k=0; k<N; k++) {
sum += A[i][k] * B[k][j]; 13141516 4 8 1216
}
C[i][j] = sum;
A B C
}
}

• 3 Loops – i, j, k
– 6 ways to arrange the loops and multiply the matrices
• O(N3) total operations
– N reads for each element of A and B
– N values to sum for each output element of C

ECE 565 – Fall 2022 15


Cache Analysis for Matrix Multiplication

• Each matrix element is 64 bits (a double)


• Assumptions:
– N is very large (cache cannot fit more than one row/column)
– Cache block size = 64 bytes (8 matrix elements per block)
• Consider access pattern for i,j,k loop structure
k j j

i k i

A B C

– A=good spatial locality; C=good temporal locality; B=poor locality

ECE 565 – Fall 2022 16


Matrix Multiplication

double A[N], B[N], C[N];


int i, j, k;
double sum; k j j
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
sum = 0; i k i
for (k=0; k<N; k++) {
sum += A[i][k] * B[k][j];
} A B C
C[i][j] = sum;
} 1
0.125 0
}
Misses per
iteration 1.125 total

• i-j-k
– Memory accesses for each inner loop iteration
• 2 loads: element A[i][k] and element B[k][j]
– A[i][k] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every iteration
• j-i-k cache miss behavior same as i-j-k
ECE 565 – Fall 2022 17
Matrix Multiplication

double A[N], B[N], C[N];


int i, j, k;
double tmp; k j j
for (k=0; k<N; k++) {
for (i=0; i<N; i++) {
tmp = A[i][k]; i k i
for (j=0; j<N; j++) {
C[i][j] += tmp * B[k][j];
} A B C
}
} 0.125
0 0.125
Misses per
iteration 0.25 total

• k-i-j
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element B[k][j]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• i-k-j cache miss behavior same as k-i-j
ECE 565 – Fall 2022 18
Matrix Multiplication

double A[N], B[N], C[N];


int i, j, k;
double tmp; k j j
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
tmp = B[k][j]; i k i
for (i=0; i<N; i++) {
C[i][j] += tmp * A[i][k];
} A B C
}
} 0
1 1
Misses per
iteration 2 total

• j-k-i
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element A[i][k]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• k-j-i cache miss behavior same as j-k-i
ECE 565 – Fall 2022 19
Matrix Multiplication Summary

• k is innermost loop
– A = good spatial locality
– C = good temporal locality
– Misses per iteration k j j
• 1 + (element sz/block sz)

• i is innermost loop i k i
– B = good temporal locality
– Misses per iteration A B C
• 2

• j is innermost loop
– B, C = good spatial locality
– A = good temporal locality
– Misses per iteration
• 2 * (element sz/block sz)

ECE 565 – Fall 2022 20


Other Types of Caching

• Main memory is a cache for disk


– Operates at a physical page granularity
• TLB is a cache for page table
– Translation Lookaside Buffer
– Operates at a page table entry granularity

ECE 565 – Fall 2022 21


Virtual Address Space
0xffffffff
Kernel Space • Each process thinks it has access
to the full address space provided
User Stack by the machine
– 4GB on 32-bit computers
– ~16-256TB on 64-bit computers
– Illusion provided by OS
Shared Libraries • Every process has its own virtual
address space
– Contents visible only to that
process
User Heap • Address space for even one
process is larger than physical
memory
Text
• How does it fit?
0x00000000
ECE 565 – Fall 2022 22
Physical Memory as a Cache for Disk
Virtual Address Spaces Physical addresses

Process 0

Process N

ECE 565 – Fall 2022 23


Physical Memory as a Cache for Disk

Virtual Address Physical addresses


Spaces
Process 0 Two Key Performance
Aspects:
1) Which addresses from
various processes should
we keep in physical
memory?
Process N
2) How do we do this
mapping between a virtual
address and its address in
physical memory?

ECE 565 – Fall 2022 24


What To Keep in Physical Memory?

• Want to service memory accesses from DRAM, not disk


– That is, if the access misses in all caches already
– Disk is thousands of times slower than DRAM
• ~60-80ns to several milliseconds

• Locality still rules the day


– Just as we discussed for caches
• Want to capture spatial and temporal locality in memory
– Temporal: retain recently accessed addresses in physical mem
– Spatial: fetch nearby addresses into physical mem on disk access

ECE 565 – Fall 2022 25


What To Keep in Physical Memory?
• OS manages the physical memory and disk accesses
– Physical memory management is software-based
– Operates on physical memory in units of pages
• Pages have some size of 2 P (e.g. 4KB)
• Much larger than cache block size due to huge latency of disk
– OS treats physical memory as a fully associative cache
• A virtual page can be placed in any physical page
– Page replacement policies may be complex
• SW can maintain and track much more state than HW
• Since accessing a page from disk is so slow there is lots of time!
• Physical memory holds the large-scale working set of a program
– Working set size of less than physical mem size
• Good performance for a process after memory is warmed up
– Working set sizes of all active processes greater than physical mem size
• Can cause a severe performance problem
• Called thrashing: pages are continuously swapped between memory and disk

ECE 565 – Fall 2022 26


Metrics

• Page hit
– Memory reference to an address that is stored in physical mem
• Page miss (page fault)
– Reference to an address that is not in physical memory
– Misses are expensive
• Access to disk
• Software is involved in managing the process

ECE 565 – Fall 2022 27


Physical Memory as a Cache for Disk
Virtual Address Spaces Physical addresses

Process 0

Process N

This looks
ECE 565 – Fall 2022 complicated!
28
What is Stored Where in Physical Memory?

• Need to remember current location for every virtual page


– Since physical mem is managed as fully associative cache
• Solution is called a page table
– Maps virtual pages to physical pages
– Per-process software data structure
Physical memory
Physical page number VP 0
or disk address
VP 4
PTE 0 1 VP 3
1 VP 1
0 null
Disk
1
1 VP 0
PTE 5 0 VP 1
VP 3
Page Table
VP 4
VP 5
ECE 565 – Fall 2022 29
Page Table Management

• Page tables are very large


– One entry per page
– For 32-bit address space:
• Assume 4 GB total virtual memory (232)
• Assume 4KB pages
• 232 / 212 = 220 entries
• PTE is 4B in x86 architecture = 222 bytes
• And that’s just for one process!

• Keep portions of the page table in memory; rest on disk


– The frequently and recently accessed portions, that is

ECE 565 – Fall 2022 30


But That’s Not Quite Enough

Processor Chip
L1 D$
L2 $ Phys
CPU MMU
Virtual Physical Mem
L1 I$
address address

• Physical addresses are needed for cache lookups


– Beginning at the L1 cache
• PTE is needed to turn a VA into PA
– PTE is located in memory (at best)
– Memory access required for every load or store?

ECE 565 – Fall 2022 31


Translation Lookaside Buffer (TLB)

• TLB is a very fast cache of PTEs


– Located inside the MMU of a processor
– Go directly from virtual page to physical page address
• Hierarchy of TLBs in current processor designs
– Separate instruction & data L1 TLBs, L2 TLB

Processor Chip TLB


L1 D$
L2 $ Phys
CPU
Virtual MMU Physical Mem
L1 I$
address address

ECE 565 – Fall 2022 32


TLB Reach

• TLB Reach
– Amount of memory accessible from the TLB
– Should cover the working set size of a process
– (# TLB entries) * (Page size)
• For example
– 64 TLB entries in L1 DTLB * 64KB pages = 4MB reach

ECE 565 – Fall 2022 33


Increasing TLB Reach

• What if we need to cover larger working sets?


• We can’t increase the number of TLB entries
– Well, we could wait a few years for a newer processor
• We can increase the page size
– Most modern architectures support a set range of page sizes
– From 4KB to 16MB
• Example - hugepages
– Consult your favorite OS manual to turn on hugepages
– RHEL example – $ grep Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB
$ grep HugePages_Total /proc/meminfo
– E.g. libhugetlbfs HugePages_Total: 0
$ sysctl –w vm.nr_hugepages=128
$ grep HugePages_Total /proc/meminfo
HugePages_Total: 128
ECE 565 – Fall 2022 34
Memory Systems
Server Diagram

Memory (DIMM Slots)

ECE 565 – Fall 2022 36


Typical Server Architecture

PCIe I/O links to


external CPU I/O links to other
components (Multi-core chip + caches) CPUs in same
(network, storage, socket (often
etc.) proprietary I/O)

All running under


single OS image
Main Memory (sees all CPU cores
and memories)
ECE 565 – Fall 2022 37
Memory Bandwidth (DRAM)

• Currently, main memory is often DRAM


– DDR standards defined by JEDEC
CPU
• Double Data Rate (Multi-core chip + caches)
– DDR4 common in current server CPUs
Mem Controller
– New servers offer DDR5
– Various speed grades available
• More on this in a minute
Main Memory
• Connected to CPU w/ channels
– e.g. 4, 8, 16 (uses chip I/O pins)
• Memory controller schedules requests from the CPU
– Reads from cache misses
– Writes from cache evictions of dirty cache blocks

ECE 565 – Fall 2022 38


DRAM Performance
• Latency: we’ve discussed before
– 60-70ns load-to-use latency in current server CPUs is common
– This would be an uncontended load (no other traffic on the chip)
• Bandwidth (peak): defined by several factors
– Number of off-chip DDR channels (e.g. 4, 8, …)
– Bit width of each channel (e.g. 8 bytes)
– Channel frequency (e.g. 3200 MHz)
• Bandwidth (effective)
– Defined by ability of the memory controller to schedule requests
to utilize the DRAM devices and DDR channels efficiently
– 80% of peak B/W is a typical reasonable utilization
– Could be a bit better for some controllers or programs
• E.g. mostly read traffic compared to an even mix of read-write
ECE 565 – Fall 2022 39
Server DRAM B/W Example

ECE 565 – Fall 2022 40


Server DRAM B/W Example (2)

• 6 DDR4 memory channels per chip; 8B channel width


• 2666 = 2.666 Giga-transfers per second (GT/s)
• Peak DDR4 B/W per-chip:
– 6 channels * 8 B/channel * 2.666 GT/s = 127.97 GB/s

• DDR4 supports up to 3200 MT/s


• What would be B/W with 8 channels and 3200 DDR4?
– 8 * 8 * 3.2 = 204.8 GB/s

• Remember this is peak; Program would see ~80% of this


ECE 565 – Fall 2022 41
More on Memory B/W
• Speaking of what a program would see…
• Remember B/W is a rate (# of bytes or accesses per time)
• Programs that generate many outstanding memory
requests can stress bandwidth more than latency
• Would we see peak Mem B/W w/ a program on 1 core?
– No
– Memory system is provisioned to provide bandwidth to all cores
• What defines B/W we could get from 1 core?
– Max number of in-flight memory requests a core can sustain
– Typically will be defined by size of Cache Miss handling structures
• e.g. size of Read Queue or MSHRs (Miss status handling registers)

ECE 565 – Fall 2022 42


Bandwidth from 1 Core
• Let’s look at an example:
– Suppose cache block size is 64 bytes, chip frequency is 2.4 GHz
– Suppose per-core L2 can support 32 pending cache misses
– Suppose average latency is 200 cycles
• Remember there is contention now so each access may take longer
– B/W = ((32 misses * 64 B/miss) / 200 cycles) * 2.4 GHz
• 24.576 GB/s would be max B/W for a single core

ECE 565 – Fall 2022 43


Other Memory Technologies
• Other memory technologies are emerging
• For use along with DRAM or as a replacement
• E.g. GPUs today use High Bandwidth Memory (HBM)
– Also defined by standards, similarly to DRAM
– Provides higher bandwidth (via wide interfaces and die stacking)
– But maximum capacity is much lower
• HBM2 / HBM2E
– Rates quoted in terms of “giga transactions per second”
– HBM2: 128B interface (wide!), up to 2 GT/s = 256 GB/s per “stack”
– e.g. NVIDIA Volta GPU:
• 4 HBM stacks per GPU (16 GB per stack) for 900 GB/s per GPU

• HBM3 is current standard


ECE 565 – Fall 2022 44

You might also like