0% found this document useful (0 votes)
18 views

Compiler Optimizations and Prefetching

This document discusses several compiler and hardware optimizations to reduce cache misses and improve performance, including: 1. Smaller direct-mapped caches can overlap tag comparison and data transmission. Lower associativity also reduces power. 2. Way prediction and pipelining caches can improve hit times but increase penalties for misses. 3. Compiler optimizations like loop interchange and blocking can improve data locality to reduce misses. Prefetching instructions can hide miss latencies. 4. Hardware techniques like nonblocking caches, multibanking, and merging write buffers can allow faster cache accesses. Critical word first and early restart can reduce effective miss penalties.

Uploaded by

Hidayatullah
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Compiler Optimizations and Prefetching

This document discusses several compiler and hardware optimizations to reduce cache misses and improve performance, including: 1. Smaller direct-mapped caches can overlap tag comparison and data transmission. Lower associativity also reduces power. 2. Way prediction and pipelining caches can improve hit times but increase penalties for misses. 3. Compiler optimizations like loop interchange and blocking can improve data locality to reduce misses. Prefetching instructions can hide miss latencies. 4. Hardware techniques like nonblocking caches, multibanking, and merging write buffers can allow faster cache accesses. Critical word first and early restart can reduce effective miss penalties.

Uploaded by

Hidayatullah
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Compiler

Optimizations
and Prefetching

OLEH :

ADITYA P. P. PRASETYO, S. Kom., MT.


Ten Advanced Optimizations

– Small and simple first level caches


– Critical timing path:
– addressing tag memory, then
– comparing tags, then
– selecting correct set
– Direct-mapped caches can overlap tag compare
and transmission of data
– Lower associativity reduces power because
fewer cache lines are accessed

2
L1 Size and Associativity

Access time vs. size and associativity


3
L1 Size and Associativity

Energy per read vs. size and associativity


4
Way Prediction

– To improve hit time, predict the way to pre-set mux


– Mis-prediction gives longer hit time
– Prediction accuracy
– > 90% for two-way
– > 80% for four-way
– I-cache has better accuracy than D-cache
– First used on MIPS R10000 in mid-90s
– Used on ARM Cortex-A8
– Extend to predict block as well
– “Way selection”
– Increases mis-prediction penalty

5
Pipelining Cache

– Pipeline cache access to improve


bandwidth
– Examples:
– Pentium: 1 cycle
– Pentium Pro – Pentium III: 2 cycles
– Pentium 4 – Core i7: 4 cycles

– Increases branch mis-prediction penalty


– Makes it easier to increase associativity

6
Nonblocking Caches

– Allow hits before


previous misses
complete
– “Hit under miss”
– “Hit under multiple miss”
– L2 must support this
– In general, processors
can hide L1 miss
penalty but not L2 miss
penalty

7
Multibanked Caches
– Organize cache as independent banks to support
simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for L2

– Interleave banks according to block address

8
Critical Word First, Early Restart

– Critical word first


– Request missed word from memory first
– Send it to the processor as soon as it arrives
– Early restart
– Request words in normal order
– Send missed work to the processor as soon as it arrives

– Effectiveness of these strategies depends on block


size and likelihood of another access to the portion
of the block that has not yet been fetched
9
Merging Write Buffer

– When storing to a block that is already pending in the write


buffer, update write buffer
– Reduces stalls due to full write buffer
– Do not apply to I/O addresses

No write
buffering

Write buffering

10
Compiler Optimizations

– Loop Interchange
– Swap nested loops to access memory in sequential order

– Blocking
– Instead of accessing entire rows or columns, subdivide
matrices into blocks
– Requires more memory accesses but improves locality of
accesses

11
Reducing Cache Misses:
5. Compiler Optimizations

12
Reducing Cache Misses:
5. Compiler Optimizations

13
Reducing Cache Misses:
5. Compiler Optimizations
– Blocking: improve temporal and spatial locality
a) multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal
accesses that can not be helped by earlier methods
b) concentrate on submatrices, or blocks

c) All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus,
there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in
this case.

14
Reducing Cache Misses:
5. Compiler Optimizations (cont’d)

– Blocking: improve temporal and spatial locality


a) To ensure that elements being accessed can fit in the cache, the original code is changed
to compute a submatrix of size B*B, where B is called the blocking factor.
b) To total number of memory words accessed is 2N3//B + N2
c) Blocking exploits a combination of spatial (Y) and temporal (Z) locality.

15
Hardware Prefetching
– Fetch two blocks on miss (include next sequential block): overlapping
memory access with execution by fetching data items before processor
requests them.

Pentium 4 Pre-fetching
16
Compiler Prefetching
– Insert prefetch instructions before data is needed
– Non-faulting: prefetch doesn’t cause exceptions

– Register prefetch
– Loads data into register
– Cache prefetch
– Loads data into cache

– Combine with loop unrolling and software


pipelining
17
Reducing Cache Miss Penalty:
Compiler-Controlled Prefetching

 Compiler inserts prefetch instructions


 An Example
for(i:=0; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1)
a[i][j] := b[j][0] * b[j+1][0]
 16-byte blocks, 8KB cache, 1-way write back, 8-byte elements;
What kind of locality, if any, exists for a and b?
a. 3 100-element rows (100 columns) visited; spatial locality: even-
indexed elements miss and odd-indexed elements hit, leading to
3*100/2 = 150 misses
b. 101 rows and 3 columns visited; no spatial locality, but there is
temporal locality: same element is used in ith and (i + 1)st iterations and
the same element is access in each i iteration (outer loop). 100 misses
for b[j+1][0] when i = 0 and 1 miss for j = 0 for a total of 101 misses
 Assuming large penalty (100 cycles and at least 7 iterations must
be prefetched). Splitting the loop into two, we have

18
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching

 Assuming that each iteration of the pre-split loop consumes 7 cycles and
no conflict and capacity misses, then it consumes a total of 7*300
iteration cycles + 251*100 cache miss cycles = 27,200 cycles;
 With prefetching instructions inserted:
for(j:=0; j<100; j:=j+1){
prefetch(b[j+7][0];
prefetch(a[0][j+7];
a[0][j] := b[j][0] * b[j+1][0];};
for(i:=1; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1){
prefetch(a[i][j+7];
a[i][j] := b[j][0] * b[j+1][0]}
19
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching (cont’d)

 An Example (continued)
 the first loop consumes 9 cycles per iteration (due to the two prefetch
instruction) and iterates 100 times for a total of 900 cycles,
 the second loop consumes 8 cycles per iteration (due to the single
prefetch instruction) and iterates 200 times for a total of 1,600 cycles,
 during the first 7 iterations of the first loop array a incurs 4 cache
misses, array b incurs 7 cache misses, for a total of (4+7)*100=1,100
cache miss cycles,
 during the first 7 iterations of the second loop for i = 1 and i = 2
array a incurs 4 cache misses each, for total of (4+4)*100=800 cache
miss cycles; array b does not incur any cache miss in the second split!
 Total cycles consumed: 900+1600+1100+800= 44000
 Prefetching improves performance: 27200/4400=6.2 folds!

20
Summary

21
THANKS

22

You might also like