Cache_optimizations
Cache_optimizations
of
Cache Performance
Hit Miss Miss H/W Why
Time Penalty Rate Complexi
ty
Larger Block - + 0 Reduce Miss rate
size
Lager Cache - + 1 Reduce Miss rate
size
Higher - + 1 Reduce Miss rate
associativity
Multilevel + 2 Reduce Miss Penalty
caches
Read priority + 1 Reduce Miss Penalty
over writes
Avoid address + 1 Reduce Hit Time
translation
during cache
indexing
●
The critical timing path in a cache hit is the three-step process of
1 addressing the tag memory( indexing),
2 comparing tags (tag comparision)
3 Selecting correct way (mux control selection)
Direct-mapped caches can overlap the tag check and transmission of the data -- reduces
hit time.
●
Lower levels of associativity reduces power because fewer cache lines are accessed.
Energy Consumption per read increases as cache size and
associativity are increased
● Three other factors that have led to the use of higher associativity
in first-level caches in recent designs.
1 Many processors take at least two clock cycles to access the cache and
thus the impact of a longer hit time may not be critical.
2 To keep the TLB out of the critical path almost all L1 caches should be
virtually indexed.
● This limits the size of the cache to the page size times the associativity,
● way prediction -- extra bits are kept in the cache to predict the way/
block within the set .
● only a single tag comparison is performed in that clock cycle in parallel with reading the cache data.
● A miss results in checking the other blocks for matches in the next clock cycle.
● cache may further lower the effective miss penalty if it can overlap multiple
misses: a “hit under multiple miss” or “miss under miss” optimization.
● “hit under miss” optimization reduces the effective miss penalty
by being helpful during a miss instead of ignoring the requests of
the processor.
● “miss under miss” is beneficial only if the memory system can
service multiple misses;
● high-performance processors ( Intel Core i7) usually support both,
● lower end processors, (ARM A8), provide only limited nonblocking support in L2.
● MSHR- Miss Service Handling Registers
● difficult to judge the impact of any single miss and hence to
calculate the average memory access time.
● miss penalty = (not the sum of the misses) the non-
overlapped time that the processor is stalled.
● In Li, Chen, Brockman, and Jouppi’s study they found that the reduction in CPI
● For the integer programs was about 7% for one hit under miss and about 12.7% for 64.
● For the floating point programs, 12.7% for one hit under miss and 17.8% for 64.
5.Multibanked Caches to Increase Cache
Bandwidth
● Rather than, cache as a single monolithic block, divide it into
independent banks that can support simultaneous accesses.
● The Arm Cortex-A8 supports 1-4 banks in its L2 cache;
● the Intel Core i7 has 4 banks in L1 and the L2 has 8 banks.
● Multiple banks also are a way to reduce power consumption both in caches and DRAM.
6.Critical Word First and Early
Restart to Reduce Miss Penalty
● Processor normally needs just one word of the block at a time.
● Don’t wait for the entire block to be loaded for restarting the processor.
● Early restart -
● Fetch the words in normal order, but
● as soon as the requested word of the block arrives send it to the processor
● and let the processor continue execution.
● L2 controller is not involved int his technique.
● Critical word first -
● Request the missed word first from memory and
● send it to the processor as soon as it arrives;
● let the processor continue execution while filling the rest of the words in the block.
● L2 cache controller forward words of a block out of order
● L1 cache controller should rearrange words in block.
● These techniques in general benefit designs with large cache
blocks.
● The benefits of critical word first and early restart depend on the
size of the block and the likelihood of another access to the
portion of the block that has not yet been fetched.
7. Merging Write Buffer to Reduce Miss
Penalty
●
Write-through caches use write buffers, to send data to lower level of the hierarchy.
●
Write-back caches use a simple buffer when a block is replaced.
●
If the write buffer is empty:
●
the data and the full address are written in the buffer,
●
write is finished from the processor’s perspective
●
And the processor continues working while the write buffer writes to memory.
Write merging: when performing a write on a block that is already pending in the write
●
●
Reduces stalls due to full write buffer.
●
Multiword writes are usually faster than writes performed one word at a time.
Figure 2.7 shows a write buffer without and with write merging.
●
Assume we had four entries in the write buffer, and each entry could hold four 64-bit words.
8. Compiler Optimizations to Reduce
Miss Rate
● Loop Interchange
● Blocking
Loop Interchange
● Both instructions and data can be prefetched, either directly into the caches or into an external buffer
that can be more quickly accessed than main memory.
● Typically, the processor fetches two blocks on a miss: the requested block and the next
consecutive block.
● The requested block is placed in the instruction cache when it returns, and the prefetched block is
placed into the instruction stream buffer.
● If the requested block is present in the instruction stream buffer, the original cache request is
canceled, the block is read from the stream buffer, and the next prefetch request is issued.
● If the miss penalty is small, the compiler just unrolls the loop once or twice,
and it schedules the prefetches with the execution.
● If the miss penalty is large, it uses software pipelining or unrolls many times
to prefetch data for a future iteration.
Cache Optimization Summary