Compiler Optimizations and Prefetching
Compiler Optimizations and Prefetching
Optimizations
and Prefetching
OLEH :
2
L1 Size and Associativity
5
Pipelining Cache
6
Nonblocking Caches
7
Multibanked Caches
– Organize cache as independent banks to support
simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for L2
8
Critical Word First, Early Restart
No write
buffering
Write buffering
10
Compiler Optimizations
– Loop Interchange
– Swap nested loops to access memory in sequential order
– Blocking
– Instead of accessing entire rows or columns, subdivide
matrices into blocks
– Requires more memory accesses but improves locality of
accesses
11
Reducing Cache Misses:
5. Compiler Optimizations
12
Reducing Cache Misses:
5. Compiler Optimizations
13
Reducing Cache Misses:
5. Compiler Optimizations
– Blocking: improve temporal and spatial locality
a) multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal
accesses that can not be helped by earlier methods
b) concentrate on submatrices, or blocks
c) All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus,
there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in
this case.
14
Reducing Cache Misses:
5. Compiler Optimizations (cont’d)
15
Hardware Prefetching
– Fetch two blocks on miss (include next sequential block): overlapping
memory access with execution by fetching data items before processor
requests them.
Pentium 4 Pre-fetching
16
Compiler Prefetching
– Insert prefetch instructions before data is needed
– Non-faulting: prefetch doesn’t cause exceptions
– Register prefetch
– Loads data into register
– Cache prefetch
– Loads data into cache
18
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching
Assuming that each iteration of the pre-split loop consumes 7 cycles and
no conflict and capacity misses, then it consumes a total of 7*300
iteration cycles + 251*100 cache miss cycles = 27,200 cycles;
With prefetching instructions inserted:
for(j:=0; j<100; j:=j+1){
prefetch(b[j+7][0];
prefetch(a[0][j+7];
a[0][j] := b[j][0] * b[j+1][0];};
for(i:=1; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1){
prefetch(a[i][j+7];
a[i][j] := b[j][0] * b[j+1][0]}
19
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching (cont’d)
An Example (continued)
the first loop consumes 9 cycles per iteration (due to the two prefetch
instruction) and iterates 100 times for a total of 900 cycles,
the second loop consumes 8 cycles per iteration (due to the single
prefetch instruction) and iterates 200 times for a total of 1,600 cycles,
during the first 7 iterations of the first loop array a incurs 4 cache
misses, array b incurs 7 cache misses, for a total of (4+7)*100=1,100
cache miss cycles,
during the first 7 iterations of the second loop for i = 1 and i = 2
array a incurs 4 cache misses each, for total of (4+4)*100=800 cache
miss cycles; array b does not incur any cache miss in the second split!
Total cycles consumed: 900+1600+1100+800= 44000
Prefetching improves performance: 27200/4400=6.2 folds!
20
Summary
21
THANKS
22