Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
(Block placement)
(Block identification)
(Block replacement)
Random, LRU
(Write strategy)
Cache-Related Terms
ICACHE : Instruction cache DCACHE (L1) : Data cache closest to registers SCACHE (L2) : Secondary data cache TCACHE (L3) : Third level data cache
Data from SCACHE has to go through DCACHE to registers TCACHE is larger than SCACHE, and SCACHE is larger than DCACHE Not all processors have TCACHE
Example:
16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%
Which one should we place in the cache? How can we tell which one is in the cache?
Cache Basics
Cache hit: a memory access that is found in the cache -- cheap Cache miss: a memory access that is not in the cache - expensive, because we need to get the data from elsewhere Consider a tiny cache (for illustration only)
X|00|0 X001 X011 X101 X111
Set associative: There are N cache banks and memory is assigned to just one of the banks. There are three algorithmic choices for which line to replace:
Random: Choose any line using an analog random number generator. This is cheap and simple to make. LRU (least recently used): Preserves temporal locality, but is expensive. This is not much better than random according to (biased) studies. FIFO (first in, first out): Random is far superior. 7
Cache line length: number of bytes loaded together in one entry Direct mapped: only one address (line) in a given range in cache Associative: 2 or more lines with different addresses exist
Direct-Mapped Cache
Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.
cache
cache
10
Block no
01234567
01234567
01234567
Main memory
11111111112 22222222333 01234567890123456789 012345678901
11
12
Diagrams
Serial: CPU
Registers Logic
Cache
Main Memory
Block no
01234567
01234567
01234567
Parallel:
Cache 1
Cache 2
... ...
Cache p
CPU 1
11111111112 22222222333 01234567890123456789 012345678901
13
CPU 2
CPU p
14
Registers
Registers are the source and destination of most CPU data operations. They hold one element each. They are made of static RAM (SRAM), which is very expensive. The access time is usually 1-1.5 CPU clock cycles. Registers are at the top of the memory subsystem. 16
15
Principals of Locality
Temporal: an item referenced now will be again soon. Spatial: an item referenced now causes neighbors to be referenced soon. Lines, not words, are moved between memory levels. Both principals are satisfied. There is an optimal line size based on the properties of the data bus and the memory subsystem designs. Cache lines are typically 32-128 bytes with 1024 being the longest currently.
18
17
Cache Thrashing
Thrashing occurs when frequently used cache lines replace each other. There are three primary causes for thrashing:
Instructions and data can conflict, particularly in unified caches. Too many variables or too large of arrays are accessed that do not fit into cache. Indirect addressing, e.g., sparse matrices.
Machine architects can add sets to the associativity. Users can buy another vendors machine. However, neither solution is realistic.
19
Indirect Addressing
d=0 do i = 1,n j = ind(i) d = d + sqrt( x(j)*x(j) + y(j)*y(j) + z(j)*z(j) ) end do Change loop statement to d = d + sqrt( r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j) )
y z r
real a(m), b(m) For a 4 Mb direct mapped cache, a(i) and b(i) are always mapped to the same cache line. This is trivially avoided using padding.
extra is at least 128 bytes in length, which is longer than a cache line on all but one memory subsystem that is available today.
22
Cache Blocking
We want blocks to fit into cache. On parallel computers we have p x cache so that data may fit into cache on p processors, but not one. This leads to superlinear speed up! Consider matrix-matrix multiply.
Cache Blocking
do kk = 1,n,nblk do jj = 1,n,nblk do ii = 1,n,nblk do k = kk,kk+nblk-1 do j = jj,jj+nblk-1 do i = ii,ii+nblk-1 c(i,j) = c(i,j) + a(i,k) * b(k,j) end do . . . end do
23 24
N M NB
K M
Lessons
Cache Size Associativity
Block Size
Factor A
Factor B
The actual performance of a simple program can be a complicated function of the architecture Slight changes in the architecture or program change the performance significantly Since we want to write fast programs, we must take the architecture into account, even on uniprocessors Since the actual performance is so complicated, we need simple models to help us design efficient algorithms We will illustrate with a common technique for improving cache performance, called blocking
Less
More
25 26
Algorithm 2:
for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)
What is memory access pattern for Algs 1 and 2? Which is faster? What if A, B, C stored by row (as in C)?
27
< j = =
2 misses per access to a & c vs. one miss per access; improve spatial locality
28
A(i,:)
=
y(i) y(i)
*
x(:)
30
Multiply C=C+A*B
for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)
C(i,j)
C(i,j)
A(i,:)
B(:,j)
32
Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix-vector mult
C(i,j)
C(i,j)
B(:,j) 33
C(i,j)
C(i,j)
A(i,:)
B(:,j) 34
Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C once = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)
Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3) Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M))
B(k,j) 35
36
700 600 500 400 300 200 100 0 10 100 200 300 400
Order of vector/Matrices
Level 3 BLAS
Mflop/s
37
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs 38 BLAS 1 (saxpy of n vectors)
Optimizing in practice
Tiling for registers
loop unrolling, use of named register variables
Tiling for multiple levels of cache Exploiting fine-grained parallelism within the processor
super scalar pipelining
Complicated compiler interactions Hard to do by hand (but youll try) Automatic optimization an active research area
PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac www.cs.berkeley.edu/~iyer/asci_slides.ps ATLAS: www.netlib.org/atlas/index.html
Let p1 = (a11 + a22) * (b11 + b22) p2 = (a21 + a22) * b11 p3 = a11 * (b12 - b22) p4 = a22 * (b21 b11) Then m11 = p1 + p4 p5 + p7 m12 = p3 + p5 m21 = p2 + p4 39 m22 = p1 + p3 p2 + p6
40
Strassen (continued)
T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)^2 = O(n^log_2 7) = O(n^2.81)
Summary
Performance programming on uniprocessors requires
understanding of memory system levels, costs, sizes understanding of fine-grained parallelism in processor to produce good instruction mix
Available in several libraries Up to several time faster if n large enough (100s) Needs more memory than standard algorithm Can be less accurate because of roundoff error Current worlds record is O(n^2.376..)
41
Blocking (tiling) is a basic approach that can be applied to many matrix algorithms Applies to uniprocessors and parallel processors
The technique works for any architecture, but choosing the blocksize b and other details depends on the architecture
Similar techniques are possible on other data structures You will get to try this in Assignment 2 (see the class homepage)
42
Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms?
43
Can only do arithmetic on data at the top of the hierarchy Higher level BLAS lets us do this
BLAS Memory Flops Ref s 3n n2 4 n2 2n 2n2 2n3 Flops/ Memory Ref s 2/ 3 2 n/ 2
Level 1
y=y+x
Level 2
y=y+Ax
Level 3
C=C+AB
32 FLOPS 20 LOADS
46
45
Amdahls Law
Amdahls Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahls Law are given below:
50
100
150
200
250
48
Number of processors
fp = 0.99
The growing processor-memory performance gap will undermine our efforts at achieving maximum possible speedup!
50
Gustafsons Law
Thus, Amdahls Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large. There is a way around this: increase the
bigger problems mean bigger grids or more particles: bigger arrays
number of serial operations generally remains constant; number of parallel operations 51 increases: parallel fraction increases
Absolute performance:
2000 1750
Speedup
Ideal 40
MFLOPS
50
O2K
30
20 500 10 250 0 0 10 20 30 40 50 60 0 8 16 24 32 40 48
problem size
Processors
Processors
Speedup is only one characteristic of a program - it is not synonymous with performance. In this comparison of two machines the code achieves comparable speedups but one of the machines is faster.
52
56
57
58
Time Efficiency per Atom 1.62E-02 1.000 1.62E-02 1.000 1.63E-02 0.997 1.63E-02 0.995 1.63E-02 0.992 1.64E-02 0.987 1.66E-02 0.975 1.69E-02 0.958 1.75E-02 0.926 1.89E-02 0.857
59
60
10