0% found this document useful (0 votes)
60 views

Lecture 5 Cache Optimization

Uploaded by

Tayyaba Asif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Lecture 5 Cache Optimization

Uploaded by

Tayyaba Asif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Lecture 5: Cache Optimization

Appendix B and Ch 2

DAP Spr.‘98 ©UCB 1


How to Improve Cache
Performance?

AMAT  HitTime  MissRate  MissPenalt y


1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
4. Increase bandwidth

DAP Spr.‘98 ©UCB 2


Introduction
Memory Hierarchy Basics
• Basic cache optimizations:
– Larger block size
» Reduces compulsory misses
» Increases capacity and conflict misses, increases miss
penalty
– Larger total cache capacity to reduce miss rate
» Increases hit time, increases power consumption
– Higher associativity
» Reduces conflict misses
» Increases hit time, increases power consumption

DAP Spr.‘98 ©UCB 3


Larger Block Size
25%

20% 1K

4K
15%
Miss
16K
Rate
10%
64K
5% 256K
Reduced
compulsory 0%
16

32

64

misses 128

256
Increased
Conflict
Block Size (bytes)
Misses

DAP Spr.‘98 ©UCB 4


Pseudo-set associative Cache

Pseudo-set associative cache


• A pseudoassociative cache is between a direct-mapped and set-
associative cache. For a set-associative cache, all entries in the set
are accessed in parallel. This slows down the access. In a pseudo-
associative cache, we view each "way" of the set as a separate
direct-mapped cache. They are accessed in sequence, not in
parallel. This saves time if the item is found in the first "way", but
wastes time if it is found in the last "way."
• On an access, you first try the first "way", then the second "way",
etc., until you get to the nth "way".
• On a hit, if it is found in the kth way, then the line is promoted to
the first way, and all lines in caches 1 to k-1 get demoted one
cache.
• On a miss, the item is placed in also placed in the first way, the
item in the nth way is evicted, and all items in the 1 to n-1 way are
demoted one cache.

DAP Spr.‘98 ©UCB 5


DAP Spr.‘98 ©UCB 6
Fast Hit Time + Low Conflict =>
Victim Cache
• How to combine fast hit time of
direct mapped, yet still avoid
conflict misses? TAGS DATA
• Add buffer to place data
discarded from cache
• Check both the cache and
victim buffer simultaneously on
data request from the CPU
Tag and Comparator One Cache line of Data
• Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of Tag and Comparator One Cache line of Data
conflicts for a 4 KB direct Tag and Comparator One Cache line of Data
mapped data cache
Tag and Comparator One Cache line of Data
• Used in Alpha, HP machines
• Opteron L3 cache is a victim To Next Lower Level In
Hierarchy
cache
DAP Spr.‘98 ©UCB 7
Reducing Misses by Hardware Prefetching
of Instructions & Data
• E.g., Instruction Prefetching
– Sequential prefetch or block prefetching
– Most processors fetch 2 blocks of instructions on a miss
– Cache Pollution if fetched block is unused!
– Extra block placed in ―stream buffer‖
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer satisfied 25% misses from 4KB
cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8 streams
satisfied 50% to 70% of misses from 2 64KB, 4-way set associative
caches
– Data Prediction is difficult, but works well with scientific applications
• Prefetching relies on having extra memory bandwidth that
can be used without penalty
• Question: What to prefetch and when to prefetch?
Instruction prefetch is fine, but data?
DAP Spr.‘98 ©UCB 8
Advanced Optimizations
Hardware Prefetching
• Fetch two blocks on miss (next sequential block)

Pentium 4 Pre-fetching
Intel Core i7 supports hardware prefetching to both L1 and L2 caches

DAP Spr.‘98 ©UCB 9


Leave it to the Programmer?
Software Prefetching Data
• Data Prefetch – Explicit prefetch instructions
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS, PowerPC, SPARC)
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Very suitable for prefetching from main memory
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth

DAP Spr.‘98 ©UCB 10


Compiler Optimization to Reduce
Miss Rate
• Nested loops may access data in memory non-
sequentially (cache misses).
• Exchange the nesting of loops can make the code
access the data in order, reduce cache misses.
• EX: If X is a two-dimensional array of size [5000,
100], allocated as row major, i.e. X(I,j) next X(I,j+1),
then modify the program as below.

/*Before*/ /*After*/
for(j=0;j<100;j++) for(i=0;i<5000;i++)
for(i=0;i<5000;i++) for(j=0;j<100;j++)
x[i][j]=2*x[i][j] x[i][j]=2*x[i][j]
DAP Spr.‘98 ©UCB 11
EX. Block Matrix Algorithm
• Operate on submatrices (blocks) instead of entire
row or columns.
• The submatrices can fit into cache.

/*Before*/ /*After*/
for(i=0;i<N;i++) for(jj=0;jj<N;jj=jj+B) // among blocks
for(j=0;j<N;j++){ for(kk=0;kk<N;kk=kk+B) // among blocks
r=0; for(i=0;i<N;i++)
for(k=0;k<N;k++) for(j=jj;j<jj+B;j++){ // within a block
r=r+y[i][k]*z[k][j]; r=0;
x[i][j] = r;} for(k=kk;k<kk+B;k++) // within a block
r=r+y[i][k]*z[k][j];
x[i][j] = x[i][j] +r;
}

DAP Spr.‘98 ©UCB 12


Figure 2.8 A snapshot of the three arrays x, y, and z when N = 6 and i = 1. The age of accesses to the array elements is
indicated by shade: white means not yet touched, light means older accesses, and dark means newer accesses.
Compared to Figure 2.9, elements of y and z are read repeatedly to calculate new elements of x. The variables i, j, and k
are shown along the rows or columns used to access the arrays.

DAP Spr.‘98 ©UCB 13


Figure 2.9 The age of accesses to the arrays x, y, and z when B = 3. Note that, in contrast to Figure 2.8, a smaller number
of elements is accessed.

DAP Spr.‘98 ©UCB 14


Summary: Miss Rate Reduction
 Memory accesses 
CPUtime  IC  CPI   Miss rate Miss penalty  Clock cycle time
 Executi on
Instruction 
• 3 Cs: Compulsory, Capacity, Conflict
0. Larger cache
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!

DAP Spr.‘98 ©UCB 15


Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
4. Increase bandwidth

DAP Spr.‘98 ©UCB 16


1. Reduce Miss Penalty:
Early Restart and Critical Word First
• Don’t wait for full block to be loaded before restarting
CPU
– Early restart—As soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
– Critical Word First—Request the missed word first from
memory and send it to the CPU as soon as it arrives; let the
CPU continue execution while filling the rest of the words in
the block. Also called wrapped fetch and requested word first
• Generally useful only in large blocks,
• Spatial locality => tend to want next sequential word,
so not clear if benefit by early restart

block

DAP Spr.‘98 ©UCB 17


2. Reducing Miss Penalty:
Read Priority over Write on Miss
• Give priority to reads over writes on a miss by putting
the writes in a write buffer
• Write-through with write buffers => RAW conflicts with
main memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write-back want buffer to hold displaced blocks
– Consider when a Read miss is replacing a dirty block
– Normally: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read,
and then do the write
– CPU stall less since restarts as soon as do read DAP Spr.‘98 ©UCB 18
Write Buffers
• Write Buffers puts words to
be written in L2
cache/memory along with
their addresses. L1 to CPU
– 2 to 4 entries deep
– all read misses are checked Write buffer
against pending writes for
dependencies (associatively)
– allows reads to proceed L2
ahead of writes
– can coalesce writes to same
block address to reduce time
(next slide)

DAP Spr.‘98 ©UCB 19


Merging Write Buffers to
Reduce Miss Penalty
• Write buffer to allow processor to continue
while waiting to write to memory
• If buffer contains modified blocks, the
addresses can be checked to see if address
of new data matches the address of a valid
write buffer entry
• If so, new data are combined with that entry
• Increases block size of write for write-
through cache of writes to sequential words,
bytes since multiword writes more efficient
to memory
• The Sun T1 (Niagara) processor, among
many others, uses write merging

DAP Spr.‘98 ©UCB 20


Write Merge in Write Buffers

DAP Spr.‘98 ©UCB 21


4: Add a second-level cache

• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
• Definitions:
– Local miss rate— misses in this cache divided by the total number
of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
Global Miss Rate is what matters
DAP Spr.‘98 ©UCB 22
Comparing Local and Global
• 32 KB 1st level cache; Miss Rates
Increasing 2nd level cache
• Local miss rate is for L2 – Very
high for small L2 size
• Single cache miss rate is the rate
if we have one cache of size in x-
axis
• Global miss rate close to single Cache Size
level cache rate provided L2 >>
L1
Log
• The idea is to reduce miss
penalty without increasing the
miss rate
• L1 speed affects the CPU clock
cycle, but not L2 speed, L2 only
affects the miss penalty of the
first-level cache
DAP Spr.‘98 ©UCB 23
AMAT Example
• For every 1000 memory references, assume 40
misses in L1 and 20 misses in L2;
Hit time in L1 is 1, L2 is 10; Miss penalty from L2 to
memory is 100 cycles; there are 1.5 memory
references per instruction. What is AMAT and
average stall cycles per instruction?
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2
– AMAT = [1 + 40/1000 * (10 + 20/40 * 100) ] *cc = 3.4 cycles
– AMAT without L2 = 1 + 40/1000 * 100 = 5 cycles => An
improvement of 1.6 cycles due to L2
• Average memory stalls per instruction = Misses per instructionL1 × Hit
timeL2 + Misses per instructionL2 × Miss penaltyL2
– Average stall cycles per instruction = 1.5 * 40/1000 * 10 + 1.5 *
20/1000 * 100 = 3.6 cycles
• Note: We have not distinguished reads and writes.
Access L2 only on L1 miss, No separate I-cache and
D-cache
DAP Spr.‘98 ©UCB 24
Reducing Miss Penalty Summary
 Memory accesses 
CPUtime  IC  CPI   Miss rate Miss penalty  Clock cycle time
 Executi on
Instruction 
• Four techniques
1. Read priority over write on miss
2. Early Restart and Critical Word First on miss
3. Write Buffer
4. Second Level Cache
• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple
levels of cache memories
– First attempts (compulsory misses) at L2 caches can
make things worse, since increased worst case is worse

DAP Spr.‘98 ©UCB 25

You might also like