0% found this document useful (0 votes)
35 views12 pages

Improving The Performance and Bandwidth Efficiency

This paper proposes a mechanism that incorporates dynamic feedback to improve hardware prefetching performance and bandwidth efficiency. It estimates prefetcher accuracy, timeliness, and cache pollution to dynamically adjust prefetcher aggressiveness. Feedback-directed prefetching improves average performance by 6.5% compared to the best conventional prefetcher, while using 18.7% less bandwidth. It also outperforms an equally bandwidth-efficient conventional prefetcher by 13.6%, eliminating large performance penalties on some benchmarks from aggressive prefetching.

Uploaded by

alex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views12 pages

Improving The Performance and Bandwidth Efficiency

This paper proposes a mechanism that incorporates dynamic feedback to improve hardware prefetching performance and bandwidth efficiency. It estimates prefetcher accuracy, timeliness, and cache pollution to dynamically adjust prefetcher aggressiveness. Feedback-directed prefetching improves average performance by 6.5% compared to the best conventional prefetcher, while using 18.7% less bandwidth. It also outperforms an equally bandwidth-efficient conventional prefetcher by 13.6%, eliminating large performance penalties on some benchmarks from aggressive prefetching.

Uploaded by

alex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Feedback Directed Prefetching:

Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers


Santhosh Srinath†‡ Onur Mutlu§ Hyesoon Kim‡ Yale N. Patt‡
‡Department of Electrical and Computer Engineering
†Microsoft §Microsoft Research The University of Texas at Austin
[email protected] [email protected] {santhosh, hyesoon, patt}@ece.utexas.edu

Abstract tention caused by prefetches can lead to increased DRAM


High performance processors employ hardware data bank conflicts, DRAM page conflicts, memory bus con-
prefetching to reduce the negative performance impact of large tention, and queueing delays. This can significantly re-
main memory latencies. While prefetching improves perfor- duce performance if it results in delaying demand (i.e.
mance substantially on many programs, it can significantly re- load/store) requests. Moreover, inaccurate prefetches in-
duce performance on others. Also, prefetching can significantly crease the energy consumption of the processor because
increase memory bandwidth requirements. This paper proposes they result in unnecessary memory accesses (i.e. waste
a mechanism that incorporates dynamic feedback into the de- memory/bus bandwidth). Bandwidth contention due to
sign of the prefetcher to increase the performance improvement prefetching will become more significant as more and
provided by prefetching as well as to reduce the negative per- more processing cores are integrated onto the same die
formance and bandwidth impact of prefetching. Our mecha- in chip multiprocessors, effectively reducing the memory
nism estimates prefetcher accuracy, prefetcher timeliness, and bandwidth available to each core. Therefore, techniques
prefetcher-caused cache pollution to adjust the aggressiveness that reduce the memory bandwidth consumption of hard-
of the data prefetcher dynamically. We introduce a new method ware prefetchers while maintaining their performance im-
to track cache pollution caused by the prefetcher at run-time. provement will become more desirable and valuable in fu-
We also introduce a mechanism that dynamically decides where ture processors [22].
in the LRU stack to insert the prefetched blocks in the cache • Second, prefetching can cause cache pollution if the
based on the cache pollution caused by the prefetcher. prefetched data displaces cache blocks that will later be
Using the proposed dynamic mechanism improves average needed by load/store instructions in the program.1 Cache
performance by 6.5% on 17 memory-intensive benchmarks in pollution due to prefetching might not only reduce perfor-
the SPEC CPU2000 suite compared to the best-performing mance but also waste memory bandwidth by resulting in
conventional stream-based data prefetcher configuration, while additional cache misses.
it consumes 18.7% less memory bandwidth. Compared to a Furthermore, prefetcher-caused cache pollution generates
conventional stream-based data prefetcher configuration that new cache misses and those generated cache misses can in
consumes similar amount of memory bandwidth, feedback di- turn generate new prefetch requests. Hence, the prefetcher
rected prefetching provides 13.6% higher performance. Our itself is a positive feedback system that can be unstable in
results show that feedback-directed prefetching eliminates the terms of both performance and bandwidth consumption.
large negative performance impact incurred on some bench- Therefore, we would like to augment the prefetcher with a
marks due to prefetching, and it is applicable to stream- negative feedback system to make it stable.
based prefetchers, global-history-buffer based delta correla-
tion prefetchers, and PC-based stride prefetchers.
Figure 1 compares the performance of varying the aggres-
siveness of a stream-based hardware data prefetcher from No
1. Introduction prefetching to Very Aggressive prefetching on 17 memory-
Hardware data prefetching works by predicting the memory
intensive benchmarks in the SPEC CPU2000 benchmark suite.2
access pattern of the program and speculatively issuing prefetch
Aggressive prefetching improves IPC performance by 84% on
requests to the predicted memory addresses before the program
average3 and by over 800% for some benchmarks (e.g. mgrid)
accesses those addresses. Prefetching has the potential to im-
compared to no prefetching. Furthermore, aggressive prefetch-
prove performance if the memory access pattern is correctly
predicted and the prefetch requests are initiated early enough 1 Note that this is a problem only in designs where prefetch requests bring
before the program accesses the predicted memory addresses. data into processor caches rather than into separate prefetch buffers [13, 11].
Since the memory latencies faced by today’s processors are on In many current processors (e.g. Intel Pentium 4 [6] or IBM POWER4 [24]),
the order of hundreds of processor clock cycles, accurate and prefetch requests bring data into the processor caches. This reduces the com-
timely prefetching of data from main memory to the processor plexity of the memory system by eliminating the need to design a separate
caches can lead to significant performance gains by hiding the prefetch buffer. It also makes the large L2 cache space available to prefetch
requests, enabling the prefetched blocks and demand-fetched blocks to share
latency of memory accesses. On the other hand, prefetching the available cache memory dynamically rather than statically partitioning the
can negatively impact the performance and energy consump- storage space for demand-fetched and prefetched data.
tion of a processor due to two major reasons, especially if the 2 Aggressiveness of the prefetcher is determined by how far the prefetcher

predicted memory addresses are not accurate: stays ahead of the demand access stream of the program as well as how many
prefetch requests are generated, as shown in Table 1 and Section 2.1.
• First, prefetching can increase the contention for the avail- 3 Similar results were reported by [8] and [18]. All average IPC results in

able memory bandwidth. Additional bandwidth con- this paper are computed as geometric mean of the IPC’s of the benchmarks.

1
ing on average performs better than conservative and middle- PC-based stride prefetcher [1]. Compared to a conventional
of-the-road prefetching. Unfortunately, aggressive prefetching GHB-based delta correlation prefetcher configuration that con-
significantly reduces performance on some benchmarks. For sumes similar amount of memory bandwidth, feedback di-
example, an aggressive prefetcher reduces the IPC performance rected prefetching provides 9.9% higher performance. The
of ammp by 48% and applu by 29% compared to no prefetch- proposed mechanism provides these benefits with a modest
ing. Hence, blindly increasing the aggressiveness of the hard- hardware storage cost of 2.54 KB and without significantly
ware prefetcher can drastically reduce performance on several increasing hardware complexity. On the remaining 9 SPEC
applications even though it improves the average performance CPU2000 benchmarks, the proposed dynamic feedback mech-
of a processor. Since aggressive prefetching significantly de- anism performs as well as the best-performing conventional
grades performance on some benchmarks, many modern pro- stream prefetcher configuration for those 9 benchmarks.
cessors employ relatively conservative prefetching mechanisms
where the prefetcher does not stay far ahead of the demand ac- 2. Background and Motivation
cess stream of the program [6, 24]. 2.1. Stream Prefetcher Design
5.00
No prefetching The stream prefetcher we model is based on the stream
4.50 Very Conservative
Middle-of-the-Road
prefetcher in the IBM POWER4 processor [24] and more de-
4.00
Very Aggressive tails on the implementation of stream-based prefetching can be
Instructions per Cycle

3.50
found in [11, 19, 24]. The modeled prefetcher brings cache
3.00
blocks from the main memory to the last-level cache, which is
2.50
the second-level (L2) cache in our baseline processor.
The stream prefetcher is able to keep track of multiple dif-
2.00

1.50
ferent access streams. For each tracked access stream, a stream
1.00
tracking entry is created in the stream prefetcher. Each tracking
0.50
entry can be in one of four different states:
0.00
gm se
sw k
ga c
fa e

n
x

p
vo r

el

six d
2

c
im
a
re
k

i
e

ea
cf

rte

m
p

i
r

w
tra
ip

pl

es
t

lg
ua
rs

gr
vp

1. Invalid: The tracking entry is not allocated a stream to


ar
ga

ce
m

am

up
bz

ap

m
pa

m
eq

Figure 1. Performance vs. aggressiveness of the prefetcher keep track of. Initially, all tracking entries are in this state.
2. Allocated: A demand (i.e. load/store) L2 miss allocates a
The goal of this paper is to reduce the negative perfor- tracking entry if the demand miss does not find any exist-
mance and bandwidth impact of aggressive prefetching while ing tracking entry for its cache-block address.
preserving the large performance benefits provided by it. To 3. Training: The prefetcher trains the direction (ascending or
achieve this goal, we propose simple and implementable mech- descending) of the stream based on the next two L2 misses
anisms that dynamically adjust the aggressiveness of the hard- that occur +/- 16 cache blocks from the first miss.4 If the
ware prefetcher as well as the location in the processor cache next two accesses in the stream are to ascending (descend-
where prefetched data is inserted. ing) addresses, the direction of the tracking entry is set to 1
The proposed mechanisms estimate the effectiveness of the (0) and the entry transitions to Monitor and Request state.
prefetcher by monitoring the accuracy and timeliness of the 4. Monitor and Request: The tracking entry monitors the ac-
prefetch requests as well as the cache pollution caused by the cesses to a memory region from a start pointer (address
prefetch requests. We describe simple hardware implemen- A) to an end pointer (address P). The maximum distance
tations to estimate accuracy, timeliness, and cache pollution. between the start pointer and the end pointer is determined
Based on the run-time estimation of these three metrics, the by Prefetch Distance, which indicates how far ahead of the
aggressiveness of the hardware prefetcher is decreased or in- demand access stream the prefetcher can send requests. If
creased dynamically. Also, based on the run-time estimation there is a demand L2 cache access to a cache block in the
of the cache pollution caused by the prefetcher, the proposed monitored memory region, the prefetcher requests cache
mechanism dynamically decides where to insert the prefetched blocks [P+1, ..., P+N] as prefetch requests (assuming the
blocks in the processor cache’s LRU stack. direction of the tracking entry is set to 1). N is called
Our results show that using the proposed dynamic feed- the Prefetch Degree. After sending the prefetch requests,
back mechanisms improve the average performance of 17 the tracking entry starts monitoring the memory region be-
memory-intensive benchmarks in the SPEC CPU2000 suite by tween addresses A+N to P+N (i.e. effectively it moves the
6.5% compared to the best-performing conventional stream- tracked memory region by N cache blocks).5
based prefetcher configuration. With the proposed mechanism, 4 Note that all addresses tracked by the prefetcher are cache-block addresses.
the negative performance impact incurred on some bench- 5 Right after a tracking entry is trained, the prefetcher sets thestart pointer to
marks due to stream-based prefetching is completely elimi- the the first L2 miss address that allocated the tracking entry and the end pointer
nated. Furthermore, the proposed mechanism consumes 18.7% to the last L2 miss address that determined the direction of the entry plus an
less memory bandwidth than the best-performing stream-based initial start-up distance. Until the monitored memory region’s size becomes
prefetcher configuration. Compared to a conventional stream- the same as the Prefetch Distance (in terms of cache blocks), the tracking entry
based prefetcher configuration that consumes similar amount increments only the end pointer by the Prefetch Degree when prefetches are
of memory bandwidth, feedback directed prefetching provides issued (i.e. the end pointer points to the last address requested as a prefetch
and the start pointer points to the L2 miss address that allocated the tracking
13.6% higher performance. We also show that the dynamic entry). After the monitored memory region’s size becomes the same as Prefetch
feedback mechanism works similarly well when implemented Distance, both the start pointer and the end pointer are incremented by Prefetch
to dynamically adjust the aggressiveness of a global-history- Degree (N) when prefetches are issued. This way, the prefetcher is able to send
buffer (GHB) based delta correlation prefetcher [10] or a prefetch requests that are Prefetch Distance ahead of the demand access stream.

2
Prefetch Distance and Prefetch Degree determine the ag- very aggresive prefetcher is used instead of a very conservative
gressiveness of the prefetcher. In a traditional prefetcher con- one. Aggressive prefetching reduces the lateness of prefetches
figuration, the values of Prefetch Distance and Prefetch De- because an aggressive prefetcher generates prefetch requests
gree are fixed at the design time of the processor. In the feed- earlier than a conservative one would.
back directed mechanism we propose, the processor dynami-
cally changes Prefetch Distance and Prefetch Degree to adjust 2.2.3. Prefetcher-Generated Cache Pollution: Prefetcher-
the aggressiveness of the prefetcher. generated cache pollution is a measure of the disturbance
caused by prefetched data in the L2 cache. It is defined as:
2.2. Metrics of Prefetcher Effectiveness
P ref etcher Generated Cache P ollution =
We use three metrics (Prefetch Accuracy, Prefetch Lateness, N umber of Demand M isses Caused By the P ref etcher
and Prefetcher-Generated Cache Pollution) as feedback inputs N umber of Demand M isses
to feedback directed prefetchers. In this section, we define
the metrics and describe the relationship between the metrics
and the performance provided by a conventional prefetcher. A demand miss is defined to be caused by the prefetcher if
We evaluate four configurations: No prefetching, Very Con- it would not have occurred had the prefetcher not been present.
servative prefetching (distance=4, degree=1), Middle-of-the- If the prefetcher-generated cache pollution is high, the perfor-
Road prefetching (distance=16, degree=2), and Very Aggressive mance of the processor can degrade because useful data in the
prefetching (distance=64, degree=4). cache could be evicted by prefetched data. Furthermore, high
2.2.1. Prefetch Accuracy: Prefetch accuracy is a measure cache pollution can also result in higher memory bandwidth
of how accurately the prefetcher can predict the memory ad- consumption by requiring the re-fetch of the displaced data
dresses that will be accessed by the program. It is defined as from main memory.

P ref etch Accuracy =


N umber of U sef ul P ref etches
N umber of P ref etches Sent T o M emory
3. Feedback Directed Prefetching (FDP)
FDP dynamically adapts the aggressiveness of the prefetcher
based on the accuracy, lateness, and pollution metrics defined
where Number of Useful Prefetches is the number of prefetched in the previous section. This section describes hardware mech-
cache blocks that are used by demand requests while they are anisms that track these metrics and the FDP mechanism.
resident in the L2 cache.
Figure 2 shows the IPC of the four configurations along with
prefetch accuracy measured over the entire run of each bench- 3.1. Collecting Feedback Information
mark. The results show that in benchmarks where prefetch ac- 3.1.1. Prefetch Accuracy: To track the usefulness of prefetch
curacy is less than 40% (applu, galgel, and ammp), employing requests, we add a bit (pref-bit), to each tag-store entry in the L2
the stream prefetcher always degrades performance compared cache.6 When a prefetched block is inserted into the cache, the
to no prefetching. In all benchmarks where prefetch accuracy pref-bit associated with that block is set. Prefetcher accuracy
exceeds 40% (except mcf), using the stream prefetcher signif- is tracked using two hardware counters. The first counter, pref-
icantly improves performance over no prefetching. For bench- total, tracks the number of prefetches sent to memory. The sec-
marks with high prefetch accuracy, performance increases as ond counter, used-total, tracks the number of useful prefetches.
the aggressiveness of the prefetcher is increased. Hence, the When a prefetch request is sent to memory, pref-total is incre-
performance improvement provided by increasing the aggres- mented. When an L2 cache block that has the pref-bit set is ac-
siveness of the prefetcher is correlated with prefetch accuracy. cessed by a demand request, the pref-bit is reset and used-total
is incremented. The accuracy of the prefetcher is computed by
2.2.2. Prefetch Lateness: Prefetch lateness is a measure of taking the ratio of used-total to pref-total.
how timely the prefetch requests generated by the prefetcher
are with respect to the demand accesses that need the prefetched 3.1.2. Prefetch Lateness: Miss Status Holding Register
data. A prefetch is defined to be late if the prefetched data (MSHR) [12] is a hardware structure that keeps track of all
has not yet returned from main memory by the time a load or in-flight memory requests. Before allocating an MSHR entry
store instruction requests the prefetched data. Even though the for a request, the MSHR checks if the requested cache block is
prefetch requests are accurate, a prefetcher might not be able to being serviced by an earlier memory request. Each entry in the
improve performance if the prefetch requests are very late. We L2 cache MSHR has a bit, called the pref-bit, which indicates
define prefetch lateness as: that the memory request was generated by the prefetcher. A
P ref etch Lateness =
Number of Late P ref etches prefetch request is late if a demand request for the prefetched
N umber of U sef ul P ref etches address is generated while the prefetch request is in the MSHR
waiting for main memory. We use a hardware counter, late-
Figure 3 shows the IPC of the four configurations along with total, to keep track of such late prefetches. If a demand re-
prefetch lateness measured over the entire run of each program. quest hits an MSHR entry that has its pref-bit set, the late-total
These results explain why prefetching does not provide signif- counter is incremented, and the pref-bit associated with that
icant performance benefit on mcf, even though the prefetch ac- MSHR entry is reset. The lateness metric is computed by tak-
curacy is close to 100%. More than 90% of the useful prefetch ing the ratio of late-total to used-total.
requests are late in mcf. In general, prefetch lateness decreases
as the prefetcher becomes more aggressive. For example, in 6 Note that several proposed prefetching implementations, such as tagged

vortex, prefetch lateness decreases from 70% to 22% when a next-sequential prefetching [5, 21] already employ pref-bits in the cache.

3
5.00 1.00
No prefetching
4.50 Very Conservative 0.90
Middle-of-the-Road

Dynamic Prefetcher Accuracy


4.00 0.80
Instructions per Cycle Very Aggressive
3.50 0.70

3.00 0.60

2.50 0.50

2.00 0.40

1.50 0.30

1.00 0.20 Very Conservative


0.50 0.10
Middle-of-the-Road
Very Aggressive
0.00 0.00

gm se

ise
sw k

ck
ga c

c
fa e

fa e
n
x

x
p

p
vo r

el

er

el
six d

id
p2

u
c
im

im
a

a
re

re
k

k
i
e

ea
cf

rte

cf

rte
m

m
p

p
i
r

w
tra

tra
pl

ip

pl
es

es
t

t
lg

lg
ua

ua
rs

rs
gr

gr
vp

vp
ar

ar
ga

ce

ga

ce
i

sw
m

m
am

am
up

up
bz

ap

bz

ap
m

m
pa

pa

vo

ga
m

m
six
eq

eq
w

w
Figure 2. IPC performance (left) and prefetch accuracy (right) with different aggressiveness configurations
5.00 1.00
No prefetching Very Conservative
4.50 Very Conservative 0.90 Middle-of-the-Road
Middle-of-the-Road Very Aggressive
4.00 0.80
Very Aggressive
Instructions per Cycle

3.50 0.70

Prefetch Lateness
3.00 0.60

2.50 0.50

2.00 0.40

1.50 0.30

1.00 0.20

0.50 0.10

0.00 0.00
gm se

ise
sw k

ck
ga c

c
fa e

fa e
n
x

x
p

p
vo r

el

er

el
six d

id
2

u
c
im

im
a

a
re

re
k

k
i
e

ea
cf

rte

cf

rte
m

m
p

p
i
r

w
tra

tra
ip

pl

ip

pl
es

es
t

t
lg

lg
ua

ua
rs

rs
gr

gr
vp

vp
ar

ar
ga

ce

ga

ce

sw
m

m
am

am
up

up
bz

ap

bz

ap
m

m
pa

pa

vo

ga
m

m
six
eq

eq
w

w
Figure 3. IPC performance (left) and prefetch lateness (right) with different aggressiveness configurations

3.1.3. Prefetcher-Generated Cache Pollution: To track the 0

number of demand misses caused by the prefetcher, the pro- 1

0
CacheBlockAddress[11:0]
cessor needs to store information about all demand-fetched XOR
CacheBlockAddress[23:12]
L2 cache blocks dislodged by the prefetcher. However, such
a mechanism is impractical as it incurs a heavy overhead in
terms of both hardware and complexity. We use the Bloom fil- Pollution Filter

ter concept [2, 20] to provide a simple cost-effective hardware Figure 4. Filter to estimate prefetcher-generated cache pollution
mechanism that can approximate the number of demand misses
3.2. Sampling-based Feedback Collection
caused by the prefetcher.
To adapt to the time-varying memory phase behavior of a
program, we use interval-based sampling for all counters de-
Figure 4 shows the filter that is used to approximate the num- scribed in Section 3.1. Program execution time is divided into
ber of L2 demand misses caused by the prefetcher. The filter intervals and the value of each counter is computed as:
consists of a bit-vector, which is indexed with the output of the
exclusive-or operation of the lower and higher order bits of the 1
cache block address. When a block that was brought into the CounterV alue = 2 CounterV alueAtT heBeginningOf T heInterval

cache due to a demand miss is evicted from the cache due to + 21 CounterV alueDuringInterval (1)
a prefetch request, the filter is accessed with the address of the
evicted cache block and the corresponding bit in the filter is
set (indicating that the evicted cache block was evicted due to The CounterValueDuringInterval is reset at the end of each
a prefetch request). When a prefetch request is serviced from sampling interval. The above equation used to update the
memory, the pollution filter is accessed with the cache-block counters (Equation 1) gives more weight to the behavior of
address of the prefetch request and the corresponding bit in the the program in the most recent interval while taking into ac-
filter is reset, indicating that the block was inserted into the count the behavior in all previous intervals. Our mechanism
cache. When a demand access misses in the cache, the filter is defines the length of an interval based on the number of useful
accessed using the cache-block address of the demand request. cache blocks evicted from the L2 cache.7 A hardware counter,
If the corresponding bit in the filter is set, it is an indication eviction-count, keeps track of the number of blocks evicted
that the demand miss was caused by the prefetcher. In such from the L2 cache. When the value of the counter exceeds a
cases, the hardware counter, pollution-total, that keeps track of statically-set threshold Tinterval , the interval ends. At the end
the total number of demand misses caused by the prefetcher of an interval, all counters described in Section 3.1 are updated
is incremented. Another counter, demand-total, keeps track of according to Equation 1. The updated counter values are then
the total number of demand misses generated by the proces- 7 There are other ways to define the length of an interval, e.g. based on the
sor and is incremented for each demand miss. Cache pollution
number of instructions executed. We use the number of useful cache blocks
caused by the prefetcher can be computed by taking the ratio of evicted to define an interval because this metric provides a more accurate view
pollution-total to demand-total. We use a 4096-entry bit vector of the memory behavior of a program than the number of instructions executed.
in our experiments.

4
used to compute the three metrics: accuracy, lateness, and pol- each case empirically. If the prefetches are causing pollu-
lution. These metrics are used to adjust the prefetcher behavior tion (all even-numbered cases), the prefetcher is adjusted to be
for the next interval. The eviction-count register is reset and a less aggressive to reduce cache pollution and to save memory
new interval begins. In our experiments, we use a value of 8192 bandwidth (except in Case 2 when the accuracy is high and
(half the number of blocks in the L2 cache) for Tinterval . prefetches are late – we do increase aggressiveness in this case
to gain more benefit from highly-accurate prefetches). If the
3.3. Dynamically Adjusting Prefetcher Behavior prefetches are late but not polluting (Cases 1, 5, 9), the aggres-
At the end of each sampling interval, the computed values siveness is increased to increase timeliness unless the prefetch
of the accuracy, lateness, and pollution metrics are used to dy- accuracy is low (Case 9 – we reduce aggressiveness in this
namically adjust prefetcher behavior. Prefetcher behavior is ad- case because a large fraction of inaccurate prefetches will waste
justed in two ways: (1) by adjusting the aggressiveness of the memory bandwidth). If the prefetches are neither late nor pol-
prefetching mechanism, (2) by adjusting the location in the L2 luting (Cases 3, 7, 11), the aggressiveness is left unchanged.
cache’s LRU stack where prefetched blocks are inserted.8
3.3.1. Adjusting Prefetcher Aggressiveness: The aggres- 3.3.2. Adjusting Cache Insertion Policy of Prefetched
siveness of the prefetcher directly determines the potential for Blocks: FDP also adjusts the location in which a prefetched
benefit as well as harm that is caused by the prefetcher. By block is inserted in the LRU-stack of the corresponding cache
dynamically adapting this parameter based on the collected set based on the observed behavior of the prefetcher. In many
feedback information, the processor can not only achieve the cache implementations, prefetched cache blocks are simply
performance benefits of aggressive prefetching during pro- inserted into the Most-Recently-Used (MRU) position in the
gram phases where aggressive prefetching performs well but LRU-stack, since such an insertion policy does not require any
also eliminate the negative performance and bandwidth im- changes to the cache implementation. Inserting the prefetched
pact of aggressive prefetching during phases where aggressive blocks into the MRU position can allow the prefetcher to be
prefetching performs poorly. more aggressive and request data long before its use because
As shown in Table 1, our baseline stream prefetcher has five this insertion policy allows the useful prefetched blocks to stay
different configurations ranging from Very Conservative to Very longer in the cache. However, if the prefetched cache blocks
Aggressive. The aggressiveness of the stream prefetcher is de- create cache pollution, having a different cache insertion policy
termined by the Dynamic Configuration Counter, a 3-bit satu- for prefetched cache blocks can help reduce the cache pollution
rating counter that saturates at values 1 and 5. The initial value caused by the prefetcher. A prefetched block that is not use-
of the Dynamic Configuration Counter is set to 3, indicating ful creates more pollution in the cache if it is inserted into the
Middle-of-the-Road aggressiveness. MRU position rather than a less recently used position because
Dyn. Config. Counter Aggressiveness Pref. Distance Pref. Degree
it stays in the cache for a longer time period, occupying cache
1 Very Conservative 4 1
space that could otherwise be allocated to a useful demand-
2 Conservative 8 1 fetched cache block. Therefore, if the prefetch requests are
3 Middle-of-the-Road 16 2 causing cache pollution, it would be desirable to reduce this
4 Aggressive 32 4
5 Very Aggressive 64 4
pollution by changing the location in the LRU stack in which
Table 1. Stream prefetcher configurations
prefetched blocks are inserted.
We propose a simple heuristic that decides where in the LRU
stack of the L2 cache set a prefetched cache block is inserted
At the end of each sampling interval, the value of the Dy-
based on the estimated prefetcher-generated cache pollution.
namic Configuration Counter is updated based on the com-
At the end of a sampling interval, the estimated cache pollu-
puted values of the accuracy, lateness, and pollution metrics.
tion metric is compared to two thresholds (Plow and Phigh )
The computed accuracy is compared to two thresholds (Ahigh
to determine whether the pollution caused by the prefetcher
and Alow ) and is classified as high, medium or low. Simi-
was low, medium, or high. If the pollution caused by the
larly, the computed lateness is compared to a single thresh-
prefetcher was low, the prefetched cache blocks are inserted
old (Tlateness ) and is classified as either late or not-late. Fi-
into the middle (MID) position in the LRU stack during the
nally, the computed pollution is compared to a single threshold
next sampling interval (for an n-way set-associative cache, we
(Tpollution ) and is classified as high (polluting) or low (not-
define the MID position in the LRU stack as the floor(n/2)th
polluting). We use static thresholds in our mechanisms. The
least-recently-used position).9 On the other hand, if the pol-
effectiveness of our mechanism can be improved by dynami-
lution caused by the prefetcher was medium, prefetched cache
cally tuning the values of these thresholds and/or using more
blocks are inserted into the LRU-4 position in the LRU stack
thresholds, but such optimization is out of the scope of this pa-
(for an n-way set-associative cache, we define the LRU-4 posi-
per. In Section 5, we show that even with untuned threshold
tion in the LRU stack as the floor(n/4)th least-recently-used
values, FDP can significantly improve performance and reduce
position). Finally, if the pollution caused by the prefetcher was
memory bandwidth consumption on different data prefetchers.
high, prefetched cache blocks are inserted into the LRU posi-
Table 2 shows in detail how the estimated values of the tion during the next sampling interval.
three metrics are used to adjust the dynamic configuration of
the prefetcher. We determined the counter update choice for
9 We found inserting prefetched blocks to the MRU position doesn’t pro-
8 Note that we adjust prefetcher behavior on a global (across-streams) basis vide significant benefits over inserting them to the MID position. Thus, our
rather than on a per-stream basis as we did not find much benefit in adjusting dynamic mechanism doesn’t insert prefetched blocks to the MRU position. For
on a per-stream basis. a detailed analysis of the cache insertion policy, see Section 5.2.

5
Case Prefetch Accuracy Prefetch Lateness Cache Pollution Dynamic Configuration Counter Update (reason)
1 High Late Not-Polluting Increment (to increase timeliness)
2 High Late Polluting Increment (to increase timeliness)
3 High Not-Late Not-Polluting No Change (best case configuration)
4 High Not-Late Polluting Decrement (to reduce pollution)
5 Medium Late Not-Polluting Increment (to increase timeliness)
6 Medium Late Polluting Decrement (to reduce pollution)
7 Medium Not-Late Not-Polluting No Change (to keep the benefits of timely prefetches)
8 Medium Not-Late Polluting Decrement (to reduce pollution)
9 Low Late Not-Polluting Decrement (to save bandwidth)
10 Low Late Polluting Decrement (to reduce pollution)
11 Low Not-Late Not-Polluting No Change (to keep the benefits of timely prefetches)
12 Low Not-Late Polluting Decrement (to reduce pollution and save bandwidth)
Table 2. How to adapt? Use of the three metrics to adjust the aggressiveness of the prefetcher

4. Evaluation Methodology 4.3. Thresholds Used in FDP Implementation


We evaluate the performance impact of FDP on an in-house The thresholds used in the implementation of our mecha-
execution-driven Alpha ISA simulator that models an aggres- nism are provided below. We determined the parameters of our
sive superscalar, out-of-order execution processor. The param- mechanism empirically using a limited number of simulation
eters of the processor we model are shown in Table 3. runs. However, we did not tune the parameters to our appli-
cation set since this requires an exponential number of simu-
lations in terms of the different parameter combinations. We
4.1. Memory Model estimate that optimizing these thresholds can further improve
We evaluate our mechanisms using a detailed memory the performance and bandwidth-efficiency of our mechanism.
model which mimics the behavior and the bandwidth/port lim-
itations of all the hardware structures in the memory system Ahigh Alow Tlateness Tpollution Phigh Plow
faithfully. All the mentioned effects are modeled correctly and 0.75 0.40 0.01 0.005 0.25 0.005
bandwidth limitations are enforced in our model as described
in [16]. The memory bus has a bandwidth of 4.5 GB/s. In systems where bandwidth contention is estimated to be
higher (e.g. systems where many threads share the memory
The baseline hardware data prefetcher we model is a stream bandwidth), Ahigh and Alow thresholds can be increased to
prefetcher that can track 64 different streams. Prefetch requests restrict the prefetcher from being too aggressive. In systems
generated by the stream prefetcher are inserted into the Prefetch where the lateness of prefetches is estimated to be higher due to
Request Queue which has 128 entries in our model. Requests higher contention in the memory system, reducing the Tlateness
are drained from this queue and inserted into the L2 Request threshold can increase performance by increasing the timeli-
Queue and are given the lowest priority so that they do not de- ness of the prefetcher. Reducing Tpollution , Phigh or Plow
lay demand load/store requests. Requests that miss in the L2 thresholds results in reducing the prefetcher-generated cache
cache access DRAM memory by going through the Bus Re- pollution. In systems with higher contention for the L2 cache
quest Queue. L2 Request Queue, Bus Request Queue, and L2 space (e.g. systems with a smaller L2 cache or with many
Fill Queue have 128 entries each. Only when a prefetch re- threads sharing the same L2 cache), reducing the values of
quest goes out on the bus does it count towards the number of Tpollution , Phigh or Plow may be desirable to reduce the cache
prefetches sent to memory. A prefetched cache block is placed pollution due to prefetching.
into the MRU position in the L2 cache in the baseline.
5. Experimental Results and Analyses
4.2. Benchmarks 5.1. Adjusting Prefetcher Aggressiveness
We focus our evaluation on those benchmarks from the We first evaluate the performance of FDP to adjust the ag-
SPEC CPU2000 suite where the most aggressive prefetcher gressiveness of the stream prefetcher (as described in Sec-
configuration sends out to memory at least 200K prefetch re- tion 3.3.1) in comparison to four traditional configurations that
quests over the 250 million instruction run. On the remain- do not incorporate dynamic feedback: No prefetching, Very
ing nine programs of the SPEC CPU2000 suite, the potential Conservative prefetching, Middle-of-the-Road prefetching, and
for improving either performance or bandwidth-efficiency of Very Aggressive prefetching. Figure 5 shows the IPC perfor-
the prefetcher is limited because the prefetcher is not active mance of each configuration. Adjusting the prefetcher aggres-
(even if it is configured very aggressively).10 For reference, siveness dynamically (i.e. Dynamic Aggressiveness) provides
the number of prefetches generated for each benchmark in the the best average performance across all configurations. Dy-
SPEC CPU2000 suite is shown in Table 4. The benchmarks namically adapting the aggressiveness of the prefetcher using
were compiled using the Compaq C/Fortran compilers with the the proposed feedback mechanism provides 4.7% higher av-
-fast optimizations and profile-driven feedback enabled. All erage IPC over the Very Aggressive configuration and 11.9%
benchmarks are fast forwarded to skip the initialization portion higher IPC over the Middle-of-the-Road configuration.
and then simulated for 250 million instructions. On almost all benchmarks, Dynamic Aggressiveness pro-
vides performance that is very close to the performance
10 We also evaluated the remaining benchmarks that have less potential. Re- achieved by the best-performing traditional prefetcher configu-
sults for these benchmarks are shown in Section 5.11. ration for each benchmark. Hence, the dynamic mechanism is

6
Pipeline 20-cycle minimum branch misprediction penalty; 4 GHz processor
Branch Predictor aggressive hybrid branch predictor (64K-entry gshare, 64K-entry per-address w/ 64K-entry selector)
wrong-path execution faithfully modeled
Instruction Window 128-entry reorder buffer; 128-entry INT, 128-entry FP physical register files; 64-entry store buffer;
Execution Core 8-wide, fully-pipelined except for FP divide; full bypass network
On-chip Caches 64KB Instruction cache with 2-cycle latency;
64KB, 4-way L1 data cache with 8 banks and 2-cycle latency, allows 4 load accesses per cycle;
1MB, 16-way, unified L2 cache with 8 banks and 10-cycle latency, 128 L2 MSHRs,
1 L2 read port, 1 L2 write port; all caches use LRU replacement and have 64B block size
Buses and Memory 500-cycle minimum main memory latency; 32 DRAM banks; 32B-wide, split-transaction
core-to-memory bus at 4:1 frequency ratio; 4.5 GB/s bus bandwidth; max. 128 outstanding misses to main memory;
bank conflicts, bandwidth, port contention, and queueing delays faithfully modeled
Table 3. Baseline processor configuration

bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr
336K 59K 4969 1656K 110K 31K 2585K 515K 9218 2749 591K 246K
ammp applu apsi art equake facerec fma3d galgel lucas mesa mgrid sixtrack swim wupwise
1157K 6038K 8656 13319K 2414K 2437K 3643 243K 1103 273K 2185K 292K 8766K 799K
Table 4. Number of prefetches sent by a very aggressive stream prefetcher for each benchmark in the SPEC CPU2000 suite

100.0
able to detect and employ the best-performing aggressiveness 90.0 Very Aggressive (5)
level for the stream prefetcher on a per-benchmark basis.

Percentage of sampling intervals


Aggressive (4)
80.0
Middle-of-the-Road (3)
Figure 5 shows that Dynamic Aggressiveness almost com- 70.0 Conservative (2)
pletely eliminates the large performance degradation incurred 60.0
Very Conservative (1)

on some benchmarks due to Very Aggressive prefetching. 50.0

While the most aggressive traditional prefetcher configuration 40.0

provides the best average performance, it results in a 28.9% 30.0

performance loss on applu and a 48.2% performance loss on 20.0

ammp compared to no prefetching. In contrast, Dynamic Ag- 10.0

gressiveness results in a 1.8% performance improvement on 0.0

ise
ck
c
fa e
x

p
er

el

id
2

u
applu and only a 5.9% performance loss on ammp compared

im
a
re
k
cf

rte

m
p

w
tra
ip

pl

es
t

lg
ua
rs

gr
vp

ar
ga

ce

sw
m

am

up
bz

ap

m
pa

vo

ga

m
six
eq

w
to no prefetching, similar to the best-performing traditional
Figure 6. Distribution of the dynamic aggressiveness level
prefetcher configuration for the two benchmarks.
5.00
No prefetching 5.2. Adjusting Cache Insertion Policy of Prefetches
4.50 Very Conservative

4.00
Middle-of-the-Road
Very Aggressive
Figure 7 shows the performance of dynamically adjust-
Dynamic Aggressiveness ing the cache insertion policy (i.e. Dynamic Insertion) us-
Instructions per Cycle

3.50

3.00 ing FDP as described in Section 3.3.2. The performance of


2.50 Dynamic Insertion is compared to four different static inser-
2.00 tion policies that always insert a prefetched block into the (1)
1.50 MRU position, (2) MID (floor(n/2)th) position where n is the
1.00 set-associativity, (3) LRU-4 (f loor(n/4)th least-recently-used)
0.50 position, and (4) LRU position in the LRU stack. The dynamic
0.00 cache insertion policy is evaluated using the Very Aggressive
gm se
sw k
ga c
fa e

prefetcher configuration.
x

p
vo r

el

six d
2

c
im
a
re
k

i
e

ea
cf

rte

m
p

i
r

w
tra
ip

pl

es
t

lg
ua
rs

gr
vp

ar
ga

ce
m

am

up
bz

ap

m
pa

m
eq

5.0
No prefetching
Figure 5. Dynamic adjustment of prefetcher aggressiveness 4.5 LRU
LRU-4
4.0 MID
5.1.1. Adapting to the Program Figure 6 shows the distri- MRU
Instructions per Cycle

3.5
Dynamic Insertion
bution of the value of the Dynamic Configuration Counter over 3.0

all sampling intervals in the Dynamic Aggressiveness mecha- 2.5

nism. For benchmarks where aggressive prefetching hurts per- 2.0

formance (e.g. applu, galgel, ammp), the feedback mechanism 1.5

chooses and employs the least aggressive dynamic configura- 1.0

tion (counter value of 1) for most of the sampling intervals. For 0.5

example, the prefetcher is configured to be Very Conservative 0.0


gm se
sw k
ga c
fa e

in more than 98% of the intervals for both applu and ammp.
x

p
vo r

el

id
2

c
im
a
re
k

i
e

ea
cf

rte

m
p

w
tra
ip

pl

es
t

lg
ua
rs

gr
vp

ar
ga

ce
m

am

up
bz

ap

m
pa

m
six
eq

On the other hand, for benchmarks where aggressive


Figure 7. Dynamic adjustment of prefetch insertion policy
prefetching significantly increases performance (e.g. wupwise,
mgrid, equake), FDP employs the most aggressive configu- The data in Figure 7 shows that statically inserting
ration for most of the sampling intervals. For example, the prefetches in the LRU position can result in significant average
prefetcher is configured to be Very Aggressive in more than performance loss compared to statically inserting prefetches in
98% of the intervals for wupwise, mgrid, and equake. the MRU position. This is because inserting prefetched blocks

7
5.0
in the LRU position causes an aggressive prefetcher to evict 4.5
No prefetching
Very Aggressive
prefetched blocks before they get used by demand loads/stores. 4.0
Dynamic Insertion
However, inserting in the LRU position eliminates the per- Dynamic Aggressiveness

Instructions per Cycle


3.5 Dyn. Aggressiveness + Dyn. Insertion
formance loss due to aggressive prefetching in benchmarks 3.0
where aggressive prefetching hurts performance (e.g. applu and 2.5
ammp). Among the static cache insertion policies, inserting 2.0
the prefetched blocks into the LRU-4 position provides the best 1.5
average performance, improving performance by 3.2% over in- 1.0
serting prefetched blocks in the MRU position. 0.5

Adjusting the cache insertion policy dynamically provides 0.0

gm se
sw k
higher performance than any of the static insertion policies. Dy-

ga c
fa e

n
x

p
vo r

el

id
2

c
im
a
re
k

i
e

ea
cf

rte

m
p

w
tra
ip

pl

es
t

lg
ua
rs

gr
vp

ar
ga

ce
m

am

up
bz

ap

m
pa

m
six
eq

w
namic Insertion achieves 5.1% better performance than insert-
ing prefetched blocks into the MRU position and 1.9% better Figure 9. Overall performance of FDP
performance than inserting them into the LRU-4 position. Fur- improvement provided by Dynamic Aggressiveness or Dynamic
thermore, Dynamic Insertion almost always provides the per- Insertion alone. Hence, dynamically adjusting both aspects of
formance of the best static insertion policy for each benchmark. prefetcher behavior (aggressiveness and insertion policy) pro-
Hence, dynamically adapting the prefetch insertion policy us- vides complementary performance benefits.
ing run-time estimates of prefetcher-generated cache pollution With the use of FDP to dynamically adjust both aspects of
is able to detect and employ the best-performing cache inser- prefetcher behavior, the performance loss incurred on some
tion policy for the stream prefetcher on a per-benchmark basis. benchmarks due to aggressive prefetching is completely elim-
Figure 8 shows the distribution of the insertion position of inated. No benchmark loses performance compared to no
the prefetched blocks when Dynamic Insertion is used. For prefetching if both Dynamic Aggressiveness and Dynamic In-
benchmarks where a static policy of inserting prefetched blocks sertion are used. In fact, FDP improves the performance of
into the LRU position provides the best performance across all applu by 13.4% and ammp by 11.4% over no prefetching – two
static configurations (applu, galgel, ammp), Dynamic Insertion benchmarks that otherwise incur very significant performance
places most (more than 50%) of the prefetched blocks into the losses with an aggressive traditional prefetcher configuration.
LRU position. Therefore, Dynamic Insertion improves the per-
formance of these benchmarks by dynamically employing the 5.4. Impact of FDP on Bandwidth Consumption
best-performing insertion policy. Aggressive prefetching can adversely affect the bandwidth
100 consumption in the memory system when prefetches are not
90 used or when they cause cache pollution. Figure 10 shows
Percentage of prefetch insertions

80 the bandwidth impact of prefetching in terms of Memory Bus


70 Accesses per thousand retired Instructions (BPKI).11 Increas-
60 ing the aggressiveness of the traditional stream prefetcher sig-
50 nificantly increases the memory bandwidth consumption, es-
40 pecially for benchmarks where the prefetcher degrades per-
30
formance. FDP reduces the aggressiveness of the prefetcher
20 LRU in these benchmarks. For example, in applu and ammp
LRU-4
10
MID our feedback mechanism usually chooses the least aggressive
0
prefetcher configuration and the least aggressive cache inser-
ise
ck
c
fa e
x

p
er

el

id
2

im
a
re
k
cf

rte

m
p

w
tra
ip

pl

es
t

lg

tion policy as shown in Figures 6 and 8. This results in the large


ua
rs

gr
vp

ar
ga

ce

sw
m

am

up
bz

ap

m
pa

vo

ga

m
six
eq

reduction in BPKI shown in Figure 10. FDP (Dynamic Aggres-


Figure 8. Distribution of the insertion position of prefetched blocks siveness and Dynamic Insertion) consumes 18.7% less memory
bandwidth than the Very Aggressive traditional prefetcher con-
figuration, while it provides 6.5% higher performance.
5.3. Putting It All Together: Dynamically Adjusting Table 5 shows the average performance and average band-
Both Aggressiveness and Insertion Policy width consumption of different traditional prefetcher configu-
rations and FDP. Compared to the traditional prefetcher config-
This section examines the use of FDP for dynamically ad- uration that consumes similar amount of memory bandwidth as
justing both the prefetcher aggressiveness (Dynamic Aggres- FDP,12 FDP provides 13.6% higher performance. Hence, in-
siveness) and the cache insertion policy of prefetched blocks corporating our dynamic feedback mechanism into the stream
(Dynamic Insertion). Figure 9 compares the performance of prefetcher significantly increases the bandwidth-efficiency of
five different mechanisms from left to right: (1) No prefetching, the baseline stream prefetcher.
(2) Very Aggressive prefetching, (3) Very Aggressive prefetch-
ing with Dynamic Insertion, (4) Dynamic Aggressiveness , and 11 We use Bus Accesses (rather than the number of prefetches sent) as our

(5) Dynamic Aggressiveness and Dynamic Insertion together. bandwidth metric, because this metric includes the effect of L2 misses caused
Using Dynamic Aggressiveness and Dynamic Insertion to- due to demand accesses as well as prefetches. If the prefetcher is polluting the
cache, then the number of L2 misses due to demand accesses also increases.
gether provides the best performance across all configurations, Hence, counting the number of bus accesses provides a more accurate measure
improving the IPC by 6.5% over the best-performing traditional of the memory bandwidth consumed by the prefetcher.
prefetcher configuration (i.e. Very Aggressive configuration). 12 Middle-of-the-Road configuration consumes only 2.5% less memory

This performance improvement is greater than the performance bandwidth than FDP.

8
70

No prefetching formance of the Very Aggressive prefetcher and FDP when


60 Very Conservative
Middle-of-the-Road
prefetched data is inserted into the L2 cache is also shown.
50
Very Aggressive 5.0
Dynamic Aggressiveness + Dynamic Insertion No prefetching
4.5 Very Aggressive (base)
40 2KB
BPKI

4.0
8KB

Instructions per Cycle


30 3.5 32KB
64KB
3.0 1MB
20 Dyn. Aggr. + Dyn. Ins.
2.5

10 2.0

1.5
0

am se
sw k
ga c
fa e

n
x

p
vo r

el

six d
p2

c
im
a
re
k

i
e

ea
cf

rte

1.0
p

i
r

w
tra
pl

es
t

lg
ua
rs

gr
vp

ar
ga

ce
i

am

up
bz

ap

m
pa

m
eq

w
0.5

Figure 10. Effect of FDP on memory bandwidth consumption 0.0

gm se
sw k
ga c
fa e

n
x

p
vo r

el

id
2

c
im
a
re
k

i
e

ea
cf

rte

m
p

w
tra
ip

pl

es
t

lg
ua
rs
No pref. Very Cons. Middle Very Aggr. FDP

gr
vp

ar
ga

ce
m

am

up
bz

ap

m
pa

m
six
eq

w
IPC 0.85 1.21 1.47 1.57 1.67 Figure 11. Performance of prefetch cache vs. FDP
BPKI 8.56 9.34 10.60 13.38 10.88
100

Table 5. Average IPC and BPKI for FDP vs conventional prefetchers 90 No prefetching
Very Aggressive (base)
80
2KB, fully-associative
70 8KB, 16-way
32KB, 16-way
5.5. Hardware Cost and Complexity of FDP 60 64KB, 16-way

BPKI
1MB, 16-way
50
Table 6 shows the hardware cost of the proposed mecha- Dyn. Aggr. + Dyn. Ins.
40
nism in terms of the required state. FDP does not add signifi- 30
cant combinational logic complexity to the processor. Combi- 20
national logic is required for the update of counters, update of 10
the pref-bits in the L2 cache, update of the entries in the pol- 0
lution filter, calculation of feedback metrics at the end of each

am se
sw k
ga c
fa e

n
x

p
vo r

el

id
2

c
im
a
re
k

i
e

ea
cf

rte

m
p

w
tra
ip

pl

es
t

lg
ua
rs

gr
vp

ar
ga

ce
m

am

up
bz

ap

m
pa
sampling interval, determination of when a sampling interval

m
six
eq

w
ends, and insertion of prefetched blocks into appropriate loca- Figure 12. Bandwidth consumption of prefetch cache vs. FDP
tions in the LRU stack of an L2 cache set. None of the required The results show that using small (2KB and 8KB) prefetch
logic is on the critical path of the processor. The storage over- caches do not provide as high performance as inserting the
head of our mechanism is less than 0.25% of the data-store size prefetched data into the L2 cache. With an aggressive
of the baseline 1MB L2 cache. prefetcher and a small prefetch cache, the prefetched blocks
are displaced by later prefetches before being used by the pro-
5.6. Using only Prefetch Accuracy for Feedback gram - which results in performance degradation with a small
We use a comprehensive set of metrics –prefetch accuracy, prefetch cache. However, larger prefetch caches (32KB and
timeliness, and pollution– in order to provide feedback to ad- larger) improve performance compared to inserting prefetched
just the prefetcher aggressiveness. In order to assess the ben- data into the L2 cache because a larger prefetch cache reduces
efit of using timeliness as well as cache pollution, we evalu- the pollution caused by prefetched data in the L2 cache while
ated a mechanism where we adapted the prefetcher aggressive- providing enough space for prefetched blocks.
ness based only on accuracy. In such a scheme, we increment Using FDP (both Dynamic Aggressiveness and Dynamic
the Dynamic Configuration Counter if the accuracy is high and Insertion) that prefetches into the L2 cache provides 5.3%
decrement it if the accuracy is low. We found that, compared to higher performance than that provided by augmenting the Very
this scheme that only uses accuracy to throttle the aggressive- Aggressive traditional prefetcher configuration with a 32KB
ness of a stream prefetcher, our comprehensive mechanism that prefetch cache. The performance of FDP is also within 2%
also takes into account timeliness and cache pollution provides of the performance of the Very Aggressive configuration with
3.4% higher performance and consumes 2.5% less bandwidth. a 64KB prefetch cache. Furthermore, the memory bandwidth
consumption of FDP is 16% and 9% less than the Very Ag-
5.7. FDP vs. Using a Prefetch Cache gressive prefetcher configurations with respectively a 32KB
Cache pollution caused by prefetches can be eliminated by and 64KB prefetch cache. Hence, FDP achieves the perfor-
bringing prefetched data into separate prefetch buffers [13, 11] mance provided by a relatively large prefetch cache bandwidth-
rather than inserting prefetched data into the L2 cache. Fig- efficiently and without requiring as large hardware cost and
ures 11 and 12 respectively show the performance and band- complexity as that introduced by the addition of a prefetch
width consumption of the Very Aggressive prefetcher with dif- cache that is larger than 32KB.
ferent prefetch cache sizes - ranging from a 2KB fully-associate
prefetch cache to a 1MB 16-way prefetch cache.13 The per- 5.8. Effect on a Global History Buffer Prefetcher
We have also implemented FDP on the C/DC (C-Zone
13 In the configurations with a prefetch cache, a prefetched cache block is Delta Correlation) variant of the Global History Buffer (GHB)
moved from the prefetch cache into the L2 cache if it is accessed by a demand prefetcher [10]. In order to vary the aggressiveness of this
load/store request. The block size of the prefetch cache and the L2 cache are
the same and the prefetch cache is assumed to be accessed in parallel with the L2 cache without any adverse latency impact on L2 cache access time.

9
pref-bit for each tag-store entry in the L2 cache 16384 blocks * 1 bit/block = 16384 bits
Pollution Filter 4096 entries * 1 bit/entry = 4096 bits
16-bit counters used to estimate feedback metrics 11 counters * 16 bits/counter = 176 bits
pref-bit for each MSHR entry 128 entries * 1 bit/entry = 128 bits
Total hardware cost 20784 bits = 2.54 KB
Percentage area overhead compared to baseline 1MB L2 cache 2.5KB/1024KB = 0.24%
Table 6. Hardware cost of feedback directed prefetching

prefetcher dynamically, we vary the Prefetch Degree.14 Be- cause the effectiveness of the prefetcher becomes more impor-
low, we show the aggressiveness configurations used for the tant when memory becomes a larger performance bottleneck.
GHB prefetcher. FDP adjusts the configuration of the GHB
prefetcher as described in Section 3.3. 5.11. Effect on Other SPEC CPU2000 Benchmarks
Figure 14 shows the IPC and BPKI impact of FDP on the
Dyn. Config. Counter Aggressiveness Prefetch Degree
remaining 9 SPEC CPU2000 benchmarks that have less po-
1 Very Conservative 4
tential. We find that our feedback directed scheme provides
2 Conservative 8
3 Middle-of-the-Road 16
0.4% performance improvement over the best performing con-
4 Aggressive 32 ventional prefetcher configuration (i.e. Middle-of-the-Road
5 Very Aggressive 64 configuration) while reducing the bandwidth consumption by
0.2%. None of the benchmarks lose performance with FDP.
Figure 13 shows the performance and bandwidth consump- Note that the best-performing conventional configuration for
tion of different GHB prefetcher configurations and the feed- these 9 benchmarks is not the same as the best-performing
back directed GHB prefetcher using both Dynamic Aggres- conventional configuration for the 17 memory-intensive bench-
siveness and Dynamic Insertion. The feedback directed GHB marks (i.e. Very-Aggressive configuration). Also note that the
prefetcher performs similarly to the best-performing traditional remaining 9 benchmarks are not bandwidth-intensive except
configuration (Very Aggressive configuration), while it con- for fma3d and gcc. In gcc, the performance improvement of
sumes 20.8% less memory bandwidth. Compared to the tra- FDP is 3.0% over the Middle-of-the-Road configuration. The
ditional GHB prefetcher configuration that consumes similar prefetcher pollutes the L2 cache and evicts many useful in-
amount of memory bandwidth as FDP (i.e. Middle-of-the- struction blocks in gcc, resulting in very long-latency instruc-
Road configuration), FDP provides 9.9% higher performance. tion cache misses that leave the processor idle. Using FDP re-
Hence, FDP significantly increases the bandwidth-efficiency of duces this negative effect by detecting the pollution caused by
GHB-based delta correlation prefetching. Note that it is possi- prefetch references and dynamically reducing the aggressive-
ble to improve the performance and bandwidth benefits of the ness of the prefetcher.
proposed mechanism by tuning the thresholds used in feedback
mechanisms to the behavior of the GHB-based prefetcher, but 6. Related Work
we did not pursue this option. Even though mechanisms for prefetching have been studied
for a long time, dynamic mechanisms to adapt the aggressive-
5.9. Effect of FDP on a PC-Based Stride Prefetcher ness of the prefetcher have not been studied as extensively as
We also evaluated FDP on a PC-based stride prefetcher [1] algorithms that decide what to prefetch. We briefly describe
and found that the results are similar to those achieved on previous work in dynamic adaptation of prefetching policies.
both stream and GHB-based prefetchers. On average, using
the feedback directed approach results in a 4% performance 6.1. Dynamic Adaptation of Data Prefetching Policies
gain and a 24% reduction in memory bandwidth compared to
the best-performing conventional configuration for a PC-based The work most related to ours in adapting the prefetcher’s
stride prefetcher. Due to space constraints, we do not present aggressiveness is Dahlgren et al.’s paper that proposed adap-
detailed graphs for these results. tive sequential (next-line) prefetching [4] for multiproces-
sors. This mechanism implemented two counters to count
5.10. Sensitivity to L2 Size and Memory Latency the number of sent prefetches (counter-sent) and the number
We evaluate the sensitivity of FDP to different cache sizes of useful prefetches (counter-used). When counter-sent sat-
and memory latencies. In these experiments, we varied the L2 urates, counter-used is compared to a static threshold to de-
cache size keeping the memory latency at 500 cycles (base- cide whether to increase or decrease the aggressiveness (i.e.
line) and varied the memory latency keeping the cache size Prefetch Distance) of the prefetcher. While Dahlgren et al.’s
at 1MB (baseline). Table 7 shows the change in average IPC mechanism to calculate prefetcher accuracy is conceptually
and BPKI provided by FDP over the best performing con- similar to ours, their approach considered only prefetch accu-
ventional prefetcher configuration. FDP provides better per- racy to dynamically adapt prefetch distance. Also, their mech-
formance and consumes significantly less bandwidth than the anism is designed for a simple sequential prefetching mech-
best-performing conventional prefetcher configuration for all anism which prefetches up to 8 cache blocks following each
evaluated cache sizes and memory latencies. As memory la- cache miss. In this paper, we provide a generalized feedback-
tency increases, the IPC improvement of FDP also increases be- directed approach for dynamically adjusting the aggressiveness
of a wide range of state-of-the-art hardware data prefetchers by
14 In the GHB-based prefetching mechanism, Prefetch Distance and Prefetch taking into account not only accuracy but also timeliness and
Degree are the same. pollution.

10
4.5 45
No prefetching No prefetching
4.0 Very Conservative 40 Very Conservative
Middle-of-the-Road Middle-of-the-Road
3.5 Very Aggressive 35 Very Aggressive
Instructions per Cycle

Dynamic Aggressiveness + Dynamic Insertion Dyn. Aggr. + Dyn. Ins.


3.0 30

2.5 25

BPKI
2.0 20

1.5 15

1.0 10

0.5 5

0.0 0

am se
gm se

sw k
sw k

ga c
ga c

fa e
fa e

n
x
x

p
p

vo r

el
vo r

el

id
id

u
p2

c
im
c
im

a
a

re
re

k
k

i
i

e
e

ea

ea
cf

rte
cf

rte

m
m

p
p

w
r

tra
tra

ip

pl
pl

es
es

t
t

lg
lg

ua
ua

rs
rs

gr
gr

vp
vp

ar
ar

ga

ce
ga

ce
i

m
m

am
am

up
up

bz

ap
bz

ap

m
m

pa
pa

m
six
m
six

eq
eq

w
w
Figure 13. Effect of FDP on the IPC performance (left) BPKI memory bandwidth consumption (right) of GHB-based C/DC prefetchers
L2 Cache Size (memory latency = 500 cycles) Memory Latency (L2 cache size = 1 MB)
512 KB 1 MB 2 MB 250 cycles 500 cycles 1000 cycles
∆ IPC ∆ BPKI ∆ IPC ∆ BPKI ∆ IPC ∆ BPKI ∆ IPC ∆ BPKI ∆ IPC ∆ BPKI ∆ IPC ∆ BPKI
0% -13.9% 6.5% -18.7% 6.3% -29.6% 4.5% -23.0% 6.5% -18.7% 8.4% -16.9%
Table 7. Change in IPC and BPKI with FDP when L2 size and memory latency are varied
5.0 5
No prefetching No prefetching
Very Conservative Very Conservative
4.5
Middle-of-the-Road Middle-of-the-Road
4.0 Very Aggressive 4 Very Aggressive
Dyn. Aggressive. + Dyn. Insertion Dyn. Aggressive. + Dyn. Insertion
Instructions per Cycle

3.5

3.0 3
BPKI

2.5

2.0 2

1.5

1.0 1

0.5

0.0 0

k
k

n
n

d
d

ty

m
ty

s
f

ip
ip

si

ea
ea

a3
si

a3

ol
c
n

ol
c

ca
ca

af

rlb
af

rlb

gc
eo
gc
eo

ap
ap

gz
gz

tw
tw

gm

am
fm
fm

lu
lu

cr
cr

pe
pe

Figure 14. IPC performance (left) and memory bandwidth consumption in BPKI (right) impact of FDP on the remaining SPEC benchmarks

When the program enters a new phase of execution, the need to keep history information for evicted L2 cache blocks.
prefetcher is tuned based on the characteristics of the phase Zhuang and Lee [25] proposed to filter prefetcher-generated
in Nesbit et al. [10]. In order to perform phase detec- cache pollution by using schemes similar to two-level branch
tion/prediction and identification of the best prefetcher config- predictors. Their mechanism tries to identify whether or not
uration for a given phase, significant amount of extra hardware a prefetch will be useful based on past information about the
is needed. In comparison, our mechanism is simpler because it usefulness of the prefetches generated to the same memory ad-
does not require phase detection or prediction mechanisms. dress or triggered by the same load instruction. In contrast, our
Recently, Hur and Lin [7] proposed a probabilistic technique mechanism does not require the collection of fine-grain infor-
that adjusts the aggressiveness of a stream prefetcher based on mation on each prefetch address or load address in order to vary
the estimated spatial locality of the program. Their approach is the aggressiveness of the prefetcher.
applicable only to stream prefetchers as it tries to estimate the Other approaches for cache pollution filtering include using
a histogram of the stream length. a profiling mechanism to mark load instructions that can trigger
hardware prefetches [23], and using compile-time techniques to
6.2. Cache Pollution Filtering mark dead cache locations so that prefetches can be inserted in
Charney and Puzak [3] proposed filtering L1 cache pollu- dead locations [9]. In comparison to these two mechanisms,
tion caused by next-sequential prefetching and shadow direc- our mechanism does not require any software or ISA support
tory prefetching from the L2 cache into the L1 cache. Their and can adjust to dynamic program behavior even if it differs
scheme associates a confirmation bit with each block in the L2 from the behavior of the compile-time profile. Lin et al. [15]
cache which indicates if the block was used by a demand ac- proposed using density vectors to determine what to prefetch
cess when it was prefetched into the L1 cache the last time. inside a region. This was especially useful in their model as
If the confirmation bit is not set when a prefetch request ac- they used very bandwidth-intensive scheduled region prefetch-
cesses the L2, the prefetch request is discarded. Extending this ing, which prefetches all the cache blocks in a memory region
scheme to prefetching from main memory to the L2 cache re- on a cache miss. This approach can be modified and combined
quires a separate structure that maintains information about the with our proposal to further remove the pollution caused by
blocks evicted from the L2 cache. This significantly increases blocks that are not used in a prefetch stream.
the hardware cost of their mechanism. Our mechanism does not Mutlu et al. [17] used the L1 caches as filters to reduce L2

11
cache pollution caused by useless prefetches. In their scheme, used as part of the selection mechanism in a hybrid prefetcher.
all prefetched blocks are placed into only the L1 cache. A Finally, the mechanisms proposed in this paper can be easily
prefetched block is placed into the L2 when it is evicted from extended to instruction prefetchers.
the L1 cache only if it was needed by a demand request while it
was in L1. In addition to useless prefetches, this approach also Acknowledgments
filters out some useful but early prefetches that are not used We thank Matthew Merten, Moinuddin Qureshi, members
while residing in the L1 cache (such prefetches are common of the HPS Research Group, and the anonymous reviewers for
in very aggressive prefetchers). To obtain performance benefit their comments and suggestions. We gratefully acknowledge
from such prefetches, their scheme can be combined with our the support of the Cockrell Foundation, Intel Corporation and
cache insertion policy. the Advanced Technology Program of the Texas Higher Edu-
cation Coordinating Board.
6.3. Cache Insertion Policy for Prefetches
Lin et al. [14] evaluated static policies to determine the References
[1] J. Baer and T. Chen. An effective on-chip preloading scheme to reduce
placement in cache of prefetches generated by a scheduled re- data access penalty. In Proceedings of Supercomputing ’91, 1991.
gion prefetcher. Their scheme placed prefetches in the LRU po- [2] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors.
sition of the LRU stack. We found that, even though inserting Communications of the ACM, 13(7):422–426, 1970.
prefetches in the LRU position reduces the cache pollution ef- [3] M. Charney and T. Puzak. Prefetching and memory system behavior of
the SPEC95 benchmark suite. IBM Journal of Reseach and Development,
fects of prefetches on some benchmarks, it also reduces the pos- 41(3):265–286, 1997.
itive benefits of aggressive stream prefetching on other bench- [4] F. Dahlgren, M. Dubois, and P. Stenström. Sequential hardware prefetch-
marks because useful prefetches–if placed in the LRU position– ing in shared-memory multiprocessors. IEEE Transactions on Parallel
can be easily evicted from the cache in an aggressive prefetch- and Distributed Systems, 6(7):733–746, 1995.
[5] J. D. Gindele. Buffer block prefetching method. IBM Technical Disclo-
ing scheme without providing any benefit. Dynamically adjust- sure Bulletin, 20(2):696–697, July 1977.
ing the insertion policy of prefetched blocks based on the esti- [6] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and
mated pollution increases performance by 1.9% over the best P. Roussel. The microarchitecture of the Pentium 4 processor. Intel Tech-
static policy (LRU-4) and by 18.8% over inserting prefetches nology Journal, Feb. 2001. Q1 2001 Issue.
[7] I. Hur and C. Lin. Memory prefetching using adaptive stream detection.
in the LRU position. In MICRO-39, 2006.
[8] S. Iacobovici, L. Spracklen, S. Kadambi, Y. Chou, and S. G. Abraham.
7. Conclusion and Future Work Effective stream-based and execution-based data prefetching. In ICS,
2004.
This paper proposed a feedback directed mechanism that dy- [9] P. Jain, S. Devadas, and L. Rudolph. Controlling cache pollution in
namically adjusts the behavior of a hardware data prefetcher prefetching with software-assisted cache replacement. Technical Report
to improve performance and reduce memory bandwidth con- CSG-462, Massachusetts Institute of Technology, 2001.
sumption. Over previous research in adaptive prefetching, our [10] K. J.Nesbit, A. S. Dhodapkar, and J. E.Smith. AC/DC: An adaptive data
cache prefetcher. In PACT, 2004.
contributions are: [11] N. P. Jouppi. Improving direct-mapped cache performance by the addi-
tion of a small fully-associative cache and prefetch buffers. In ISCA-17,
• We propose a comprehensive and low-cost feedback 1990.
mechanism that takes into account prefetch accuracy, [12] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In
timeliness, and cache pollution caused by prefetch re- ISCA-8, 1981.
[13] R. L. Lee, P.-C. Yew, and D. H. Lawrie. Data prefetching in shared mem-
quests together to both throttle the aggressiveness of the ory multiprocessors. In ICPP, 1987.
prefetcher and to decide where in the cache to place the [14] W.-F. Lin, S. K. Reinhardt, and D. Burger. Reducing DRAM latencies
prefetched blocks. Previous approaches considered using with an integrated memory hierarchy design. In HPCA-7, 2001.
only prefetch accuracy to determine the aggressiveness of [15] W.-F. Lin, S. K. Reinhardt, D. Burger, and T. R. Puzak. Filtering super-
fluous prefetches using density vectors. In ICCD, 2001.
simple sequential (next-line) prefetchers. [16] O. Mutlu, H. Kim, D. N. Armstrong, and Y. N. Patt. An analysis of the
performance impact of wrong-path memory references on out-of-order
• We develop a low-cost mechanism to estimate at run-time and runahead execution processors. IEEE Transactions on Computers,
the cache pollution caused by hardware prefetching. 54(12):1556–1571, Dec. 2005.
[17] O. Mutlu, H. Kim, D. N. Armstrong, and Y. N. Patt. Using the first-level
• We propose and evaluate using comprehensive feedback caches as filters to reduce the pollution caused by speculative memory
mechanisms for state-of-the-art stream prefetchers that are references. International Journal of Parallel Programming, 33(5):529–
559, October 2005.
commonly employed by today’s high-performance pro- [18] O. Mutlu, H. Kim, and Y. N. Patt. Techniques for efficient processing in
cessors. Our feedback-directed mechanism is applicable runahead execution engines. In ISCA-32, 2005.
to any kind of hardware data prefetcher. We show that it [19] S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary
works well with stream-based prefetchers, global-history- cache replacement. In ISCA-21, 1994.
[20] J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai. Bloom filtering cache
buffer based prefetchers and PC-based stride prefetchers. misses for accurate data speculation and prefetching. In ICS, 2002.
Previous adaptive mechanisms were applicable to only [21] A. J. Smith. Cache memories. Computing Surveys, 14(4):473–530, 1982.
simple sequential prefetchers [4]. [22] L. Spracklen and S. G. Abraham. Chip multithreading: Opportunities and
challenges. In HPCA-11, 2005.
Future work can incorporate other important metrics, such [23] V. Srinivasan, G. S. Tyson, and E. S. Davidson. A static filter for reducing
prefetch traffic. Technical Report CSE-TR-400-99, University of Michi-
as available memory bandwidth, estimates of the contention in gan Technical Report, 1999.
the memory system, and prefetch coverage, into the dynamic [24] J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 system
feedback mechanism to provide further improvement in perfor- microarchitecture. IBM Technical White Paper, Oct. 2001.
mance and further reduction in memory bandwidth consump- [25] X. Zhuang and H.-H. S. Lee. A hardware-based cache pollution filtering
mechanism for aggressive prefetches. In ICPP-32, 2003.
tion. The metrics defined and used in this paper could also be

12

You might also like