0% found this document useful (0 votes)
9 views

Cache_optimizations

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Cache_optimizations

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Ten Advanced Optimizations

of
Cache Performance
Hit Miss Miss H/W Why
Time Penalty Rate Complexi
ty
Larger Block - + 0 Reduce Miss rate
size
Lager Cache - + 1 Reduce Miss rate
size
Higher - + 1 Reduce Miss rate
associativity
Multilevel + 2 Reduce Miss Penalty
caches
Read priority + 1 Reduce Miss Penalty
over writes
Avoid address + 1 Reduce Hit Time
translation
during cache
indexing

+/- improves/hurts factor Blank – no impact 0-3 easy to challenge


Ten Advanced Optimizations of Cache Performance
1. Reducing the hit time -
a. Small and simple first-level caches
b. Way-prediction.
** Both techniques also generally decrease power consumption.
2. Increasing cache bandwidth -
a. Pipelined caches,
b. multibanked caches, and
c. nonblocking caches.
**These techniques have varying impacts on power consumption.
3. Reducing the miss penalty -
a. Critical word first and
b. merging write buffers.
** These optimizations have little impact on power.
4. Reducing the miss rate -
a. Compiler optimizations.
** Obviously any improvement at compile time improves power consumption.
5. Reducing the miss penalty or miss rate via parallelism -
a. Hardware prefetching and
b. compiler prefetching.
**These optimizations generally increase power consumption
Hardware complexity increases as we go through these optimizations and require sophisticated
compiler technology.
1. Small and Simple First-Level Caches to
Reduce Hit Time and Power
● fast clock cycle and power limitations -- Small size .

reduce both hit time and power - Simple levels of associativity -- tradeoffs


The critical timing path in a cache hit is the three-step process of
1 addressing the tag memory( indexing),
2 comparing tags (tag comparision)
3 Selecting correct way (mux control selection)

Direct-mapped caches can overlap the tag check and transmission of the data -- reduces
hit time.

● Hit time for



direct mapped is slightly faster than two-way set associative and ● These estimates
depend on

two-way set associative is 1.2 times faster than four-way and technology as

four-way is 1.4 times faster than eight-way. well as the size
of the cache


Lower levels of associativity reduces power because fewer cache lines are accessed.
Energy Consumption per read increases as cache size and
associativity are increased
● Three other factors that have led to the use of higher associativity
in first-level caches in recent designs.

1 Many processors take at least two clock cycles to access the cache and
thus the impact of a longer hit time may not be critical.

2 To keep the TLB out of the critical path almost all L1 caches should be
virtually indexed.
● This limits the size of the cache to the page size times the associativity,

3 Introduction of multithreading --- conflict misses can increase, making


higher associativity more attractive.
2. Way Prediction to Reduce Hit Time
● Predict the way in a set to reduce hit time
● Reduces conflict misses and the hit speed is of direct-mapped
cache.

● way prediction -- extra bits are kept in the cache to predict the way/
block within the set .
● only a single tag comparison is performed in that clock cycle in parallel with reading the cache data.
● A miss results in checking the other blocks for matches in the next clock cycle.

● Block Predictor bits are added to each block of a cache .


– The bits select which of the blocks to try on the next cache access.
● If the predictor is correct, the cache access latency is the fast
hit time.
● If not, it tries the other block, changes the way predictor, and has
a latency of one extra clock cycle.
● set prediction accuracy is 90% for a two-way set associative cache and 80% for a four-way set
associative cache
● better accuracy on I-caches than D-caches.
● way selection -- use way prediction bits to decide which cache
block to actually access
– saves power when the way prediction is correct
– but adds significant time on a way misprediction,
– likely to make sense only in low-power processors.
– significant draw back for way selection is that it makes it
difficult to pipeline the cache access.
3. Pipelined Cache Access to
Increase Cache Bandwidth
● pipelined cache for faster clock cycle time
● Split cache memory access into several sub stages
● Indexing, Tag read, Hit/Miss check, Data Transfer

● pipeline cache access for high bandwidth


● Intel Pentium processors in the mid-1990s took 1 clock cycle,
● for the Pentium Pro through Pentium III in the mid-1990s through 2000 it took 2 clocks, and
● for the Pentium 4, which became available in 2000, and
● the current Intel Core i7 it takes 4 clocks.

● but slow hits.


– leading to a greater branch miss penalty
● Makes it easier to for high degrees of associativity.
4.Nonblocking Caches (lockup-free cache)
to Increase Cache Bandwidth
● Computers that allow out-of-order execution, the processor need not stall
on a data cache miss.
● Eg: continue fetching instructions from the instruction cache
while waiting for the data cache to return data.

● cache may further lower the effective miss penalty if it can overlap multiple
misses: a “hit under multiple miss” or “miss under miss” optimization.
● “hit under miss” optimization reduces the effective miss penalty
by being helpful during a miss instead of ignoring the requests of
the processor.
● “miss under miss” is beneficial only if the memory system can
service multiple misses;
● high-performance processors ( Intel Core i7) usually support both,
● lower end processors, (ARM A8), provide only limited nonblocking support in L2.
● MSHR- Miss Service Handling Registers
● difficult to judge the impact of any single miss and hence to
calculate the average memory access time.
● miss penalty = (not the sum of the misses) the non-
overlapped time that the processor is stalled.

– The benefit of nonblocking caches is complex, as it depends


upon
– the miss penalty when there are multiple misses,
– the memory reference pattern, and
– how many instructions the processor can execute with a
miss outstanding.
● Out-of-order processors are capable of hiding much of the miss penalty of an L1 data cache miss
that hits in the L2 cache but are not capable of hiding a significant fraction of a lower level cache
miss.

● Deciding how many outstanding misses to support depends on a variety of factors:


● The temporal and spatial locality in the miss stream, which determines whether a miss
can initiate a new access to a lower level cache or to memory
● The bandwidth of the responding memory or cache
● To allow more outstanding misses at the lowest level of the cache requires supporting at
least that many misses at a higher level, since the miss must initiate at the highest level
cache
● The latency of the memory system

● In Li, Chen, Brockman, and Jouppi’s study they found that the reduction in CPI
● For the integer programs was about 7% for one hit under miss and about 12.7% for 64.
● For the floating point programs, 12.7% for one hit under miss and 17.8% for 64.
5.Multibanked Caches to Increase Cache
Bandwidth
● Rather than, cache as a single monolithic block, divide it into
independent banks that can support simultaneous accesses.
● The Arm Cortex-A8 supports 1-4 banks in its L2 cache;
● the Intel Core i7 has 4 banks in L1 and the L2 has 8 banks.

● Banking works best when the accesses naturally spread themselves


across the banks, so the mapping of addresses to banks affects the
behavior of the memory system.
● A simple mapping : sequential interleaving.
● For example, if there are four banks,
● bank 0 has all blocks whose address modulo 4 is 0,
● bank 1 has all blocks whose address modulo 4 is 1, and so on.
Figure 2.6. Four-way interleaved cache banks using block addressing. Assuming 64
bytes per blocks, each of these addresses would be multiplied by 64 to get byte addressing.

● Multiple banks also are a way to reduce power consumption both in caches and DRAM.
6.Critical Word First and Early
Restart to Reduce Miss Penalty
● Processor normally needs just one word of the block at a time.

● Don’t wait for the entire block to be loaded for restarting the processor.

● Early restart -
● Fetch the words in normal order, but
● as soon as the requested word of the block arrives send it to the processor
● and let the processor continue execution.
● L2 controller is not involved int his technique.
● Critical word first -
● Request the missed word first from memory and
● send it to the processor as soon as it arrives;
● let the processor continue execution while filling the rest of the words in the block.
● L2 cache controller forward words of a block out of order
● L1 cache controller should rearrange words in block.
● These techniques in general benefit designs with large cache
blocks.

● Spatial locality : there is a good chance that the next reference


is to the rest of the block.

● Miss penalty is not simple to calculate.


– When there is a second request in critical word first, the effective miss penalty is the
nonoverlapped time from the reference until the second piece arrives.

● The benefits of critical word first and early restart depend on the
size of the block and the likelihood of another access to the
portion of the block that has not yet been fetched.
7. Merging Write Buffer to Reduce Miss
Penalty

Write-through caches use write buffers, to send data to lower level of the hierarchy.


Write-back caches use a simple buffer when a block is replaced.


If the write buffer is empty:

the data and the full address are written in the buffer,

write is finished from the processor’s perspective

And the processor continues working while the write buffer writes to memory.

Write merging: when performing a write on a block that is already pending in the write

buffer, update write buffer



Ex: Intell Core i7 uses write merging.


Reduces stalls due to full write buffer.


Multiword writes are usually faster than writes performed one word at a time.
Figure 2.7 shows a write buffer without and with write merging.

Assume we had four entries in the write buffer, and each entry could hold four 64-bit words.
8. Compiler Optimizations to Reduce
Miss Rate
● Loop Interchange
● Blocking
Loop Interchange

● Nested loops access data in nonsequential order.


● Swap nested loops to access the data in sequential order .
● Ex: x is a two-dimensional array of size [5000,100]

● reduces misses by improving spatial locality;


● reordering maximizes use of data in a cache block before they are discarded.
● original code : skip through memory in strides
of 100 words,
● revised version : accesses all the words in one
cache block then go to next block.

● improves cache performance without


affecting the number of instructions executed.
Blocking

● Instead of operating on entire rows or columns subdivide matrices


into blocks.

● Requires more accesses but improves locality of accesses

● The goal is to maximize accesses to the data loaded into the


cache before the data are replaced.

● This optimization improves temporal locality to reduce misses.


White – Not yet touched
Lighter shade – older accesses
Dark – newer accesses

Elements of y and Z are called repeatedly


To calculate new elements of x

Cache: one NXN matrix, one row of N


Then atleast ith row of y and array z may stay in the cache.

Worst case: 2N3+N2 accesses for N3 operations


2N3/B+N2 accesses, improvement
By a factor of B

Blocking exploits locality: y


benefits from spatial locality, z
benefits from temporal locality
9.Hardware Prefetching of Instructions
and Data to Reduce Miss Penalty or Miss
Rate
● Nonblocking caches effectively reduce the miss penalty by overlapping execution with memory
access.

● Another approach is to prefetch items before the processor requests them.

● Both instructions and data can be prefetched, either directly into the caches or into an external buffer
that can be more quickly accessed than main memory.

● Instruction prefetch is frequently done in hardware outside of the cache.

● Typically, the processor fetches two blocks on a miss: the requested block and the next
consecutive block.

● The requested block is placed in the instruction cache when it returns, and the prefetched block is
placed into the instruction stream buffer.

● If the requested block is present in the instruction stream buffer, the original cache request is
canceled, the block is read from the stream buffer, and the next prefetch request is issued.

● A similar approach can be applied to data accesses


10. Compiler-Controlled Prefetching to
Reduce Miss Penalty or Miss Rate
● An alternative to hardware prefetching is for the compiler to
insert prefetch instructions to request data before the
processor needs it.

● There are two flavors of prefetch:


● Register prefetch -- load data into register.
● Cache prefetch loads data into cache.

● Use loop unrolling and scheduling for prefeth data of adjacent


iterations.
● A normal load instruction could be considered a “faulting register prefetch
instruction.”
● Nonfaulting prefetches simply turn into no-ops if they would normally result
in an exception, which is what we want.

● Prefetching makes sense only if the processor can proceed while


prefetching the data; that is, the caches do not stall but continue to supply
instructions and data while waiting for the prefetched data to return.

● The goal is to overlap execution with the prefetching of data.

● If the miss penalty is small, the compiler just unrolls the loop once or twice,
and it schedules the prefetches with the execution.
● If the miss penalty is large, it uses software pipelining or unrolls many times
to prefetch data for a future iteration.
Cache Optimization Summary

The techniques to improve hit time, bandwidth, miss penalty, and


miss rate generally affect the other components of the average
memory access equation as well as the complexity of the memory
hierarchy.

You might also like