Unit 3 - LM11 - Memory Prefetching
Unit 3 - LM11 - Memory Prefetching
Prefetching
Even programs with good data locality will now and then have to access a cache line that is not
in the cache, and will then stall until the data has been fetched from main memory. It would of
course be better if there was a way to load the data into the cache before it is needed so the stall
could be avoided. This is called prefetching and there are two ways to achieve it, software
prefetching and hardware prefetching.
Software Prefetching
With software prefetching the programmer or compiler inserts prefetch instructions into the
program. These are instructions that initiate a load of a cache line into the cache, but do not stall
waiting for the data to arrive.
A critical property of prefetch instructions is the time from when the prefetch is executed to
when the data is used. If the prefetch is too close to the instruction using the prefetched data, the
cache line will not have had time to arrive from main memory or the next cache level and the
instruction will stall. This reduces the effectiveness of the prefetch.
If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache
line will instead already have been evicted again before the data is actually used. The instruction
using the data will then cause another fetch of the cache line and have to stall. This not only
eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache
line is now fetched twice from main memory or the next cache level. This increases the memory
bandwidth requirement of the program.
Processors that have multiple levels of caches often have different prefetch instructions for
prefetching data into different cache levels. This can be used, for example, to prefetch data from
main memory to the L2 cache far ahead of the use with an L2 prefetch instruction, and then
prefetch data from the L2 cache to the L1 cache just before the use with a L1 prefetch
instruction.
There is a cost for executing a prefetch instruction. The instruction has to be decoded and it uses
some execution resources. A prefetch instruction that always prefetches cache lines that are
already in the cache will consume execution resources without providing any benefit. It is
therefore important to verify that prefetch instructions really prefetch data that is not already in
the cache.
The cache miss ratio needed by a prefetch instruction to be useful depends on its purpose. A
prefetch instruction that fetches data from main memory only needs a very low miss ratio to be
useful because of the high main memory access latency. A prefetch instruction that fetches cache
lines from a cache further from the processor to a cache closer to the processor may need a miss
ratio of a few percent to do any good.
It is common that software prefetching fetches slightly more data than is actually used. For
example, when iterating over a large array it is common to prefetch data some distance ahead of
the loop, for example, 1 kilobyte ahead of the loop. When the loop is approaching the end of the
array the software prefetching should ideally stop. However, it is often cheaper to continue to
prefetch data beyond the end of the array than to insert additional code to check when the end of
the array is reached. This means that 1 kilobyte of data beyond the end of the array that isn't
needed is fetched.
Hardware Prefetching
Many modern processors implement hardware prefetching. This means that the processor
monitors the memory access pattern of the running program and tries to predict what data the
program will access next and prefetches that data. There are few different variants of how this
can be done.
A stream prefetcher looks for streams where a sequence of consecutive cache lines are accessed
by the program. When such a stream is found the processor starts prefetching the cache lines
ahead of the program's accesses.
A stride prefetcher looks for instructions that make accesses with regular strides, that do not
necessarily have to be to consecutive cache lines. When such an instruction is detected the
processor tries to prefetch the cache lines it will access ahead of it.
An adjacent cache line prefetcher automatically fetches adjacent cache lines to ones being
accessed by the program. This can be used to mimic behaviour of a larger cache line size in a
cache level without actually having to increase the line size.
Hardware prefetchers can generally only handle very regular access patterns. The cost of
prefetching data that isn't used can be high, so processor designers have to be conservative.
3. Software Prefetching
3.1. Manual Prefetching
Manual prefetching involves the programmer explicitly inserting prefetch instructions into the
code. These instructions are hints to the processor to load specific data into the cache ahead of
time. This approach can be highly effective when the programmer has deep knowledge of the
application's access patterns.
3.2. Compiler-based Prefetching
Compilers can automatically insert prefetch instructions based on their analysis of the code. The
compiler looks for loops and other predictable patterns to determine where prefetching could be
beneficial. This method reduces the burden on the programmer and can optimize for specific
hardware characteristics.
4. Hardware Prefetching
4.1. Next-line Prefetching
Next-line prefetching is one of the simplest forms of hardware prefetching. When a cache miss
occurs, the cache controller not only fetches the requested cache line but also the next sequential
line. This technique is effective for workloads with sequential memory access patterns.
4.2. Stride Prefetching
Stride prefetching involves detecting regular access patterns with a fixed stride. For example, if a
program frequently accesses memory addresses in the sequence
4.3. Adaptive Prefetching
Adaptive prefetchers adjust their strategies based on the observed behavior of the running
program. They can switch between different prefetching techniques or modify their
aggressiveness depending on the workload characteristics. This adaptability helps in optimizing
performance across a variety of applications.
4.4. Correlation-based Prefetching
This sophisticated method involves maintaining a history of cache misses and identifying
patterns or correlations in these misses. When a cache miss occurs, the prefetcher uses this
historical data to predict and prefetch future addresses. While complex, correlation-based
prefetching can significantly boost performance for applications with recurring access patterns.
5. Prefetching Performance Metrics
To evaluate the effectiveness of prefetching techniques, several performance metrics are used:
Prefetch Accuracy: The ratio of useful prefetches (those that are actually used by the processor)
to the total number of prefetches issued.
Prefetch Coverage: The proportion of cache misses that are eliminated due to prefetching.
Prefetch Timeliness: The degree to which prefetched data is available in the cache when needed.
Prefetch Overhead: The additional memory traffic and cache pollution caused by prefetching.
6. Challenges in Prefetching
6.1. Prefetching Overhead
While prefetching can improve performance, it also introduces overhead. Unnecessary prefetches
can lead to increased memory traffic and cache pollution, where useful data is evicted from the
cache prematurely.
6.2. Prefetching Timeliness
For prefetching to be effective, the prefetched data must be in the cache when the processor
needs it. If data is prefetched too early, it might be evicted before use. If prefetched too late, it
does not help in reducing latency.
6.3. Adaptability
Different applications have different memory access patterns. A prefetching strategy that works
well for one application might not be effective for another. Prefetchers need to be adaptable to
various workloads.
6.4. Hardware Complexity
Implementing advanced prefetching techniques adds complexity to the hardware. This can lead
to increased power consumption and design challenges.
7. Case Studies and Examples
7.1. Sequential Prefetching in Multimedia Applications
Multimedia applications, such as video processing and image rendering, often access memory in
a sequential manner. Sequential prefetching can be highly effective in such scenarios. By
prefetching the next blocks of data, the application can maintain a smooth and uninterrupted data
flow.
7.2. Stride Prefetching in Scientific Computing
Scientific applications frequently involve array processing with regular strides. Stride
prefetching can significantly enhance performance by prefetching future elements of the array
based on the detected stride pattern. This reduces the wait time for data fetch operations and
boosts computational efficiency.
7.3. Adaptive Prefetching in General-purpose Computing
General-purpose applications exhibit a wide range of memory access patterns. Adaptive
prefetchers, which can switch between different strategies, are particularly useful in such
environments. By continuously monitoring access patterns and adjusting prefetching strategies,
adaptive prefetchers optimize performance across diverse workloads.
8. Prefetching in Modern Processors
Modern processors, including those from Intel and AMD, incorporate sophisticated hardware
prefetching mechanisms. These processors use a combination of next-line prefetching, stride
prefetching, and adaptive techniques to optimize memory access latency. Additionally, software
developers can leverage compiler-based prefetching to further enhance application performance.
9. Future Directions in Prefetching
As technology advances, new prefetching techniques and enhancements continue to emerge.
Some potential future directions include:
9.1. Machine Learning-based Prefetching
Machine learning algorithms can analyze memory access patterns and predict future accesses
with high accuracy. By training models on historical data, processors can implement highly
effective and adaptive prefetching strategies.
9.2. Prefetching in Non-volatile Memory Systems
With the advent of non-volatile memory (NVM) technologies, new challenges and opportunities
arise for prefetching. Prefetching strategies need to be adapted to the unique characteristics of
NVM, such as higher write latencies and endurance limitations.