Improving The Performance and Bandwidth Efficiency
Improving The Performance and Bandwidth Efficiency
predicted memory addresses are not accurate: stays ahead of the demand access stream of the program as well as how many
prefetch requests are generated, as shown in Table 1 and Section 2.1.
• First, prefetching can increase the contention for the avail- 3 Similar results were reported by [8] and [18]. All average IPC results in
able memory bandwidth. Additional bandwidth con- this paper are computed as geometric mean of the IPC’s of the benchmarks.
1
ing on average performs better than conservative and middle- PC-based stride prefetcher [1]. Compared to a conventional
of-the-road prefetching. Unfortunately, aggressive prefetching GHB-based delta correlation prefetcher configuration that con-
significantly reduces performance on some benchmarks. For sumes similar amount of memory bandwidth, feedback di-
example, an aggressive prefetcher reduces the IPC performance rected prefetching provides 9.9% higher performance. The
of ammp by 48% and applu by 29% compared to no prefetch- proposed mechanism provides these benefits with a modest
ing. Hence, blindly increasing the aggressiveness of the hard- hardware storage cost of 2.54 KB and without significantly
ware prefetcher can drastically reduce performance on several increasing hardware complexity. On the remaining 9 SPEC
applications even though it improves the average performance CPU2000 benchmarks, the proposed dynamic feedback mech-
of a processor. Since aggressive prefetching significantly de- anism performs as well as the best-performing conventional
grades performance on some benchmarks, many modern pro- stream prefetcher configuration for those 9 benchmarks.
cessors employ relatively conservative prefetching mechanisms
where the prefetcher does not stay far ahead of the demand ac- 2. Background and Motivation
cess stream of the program [6, 24]. 2.1. Stream Prefetcher Design
5.00
No prefetching The stream prefetcher we model is based on the stream
4.50 Very Conservative
Middle-of-the-Road
prefetcher in the IBM POWER4 processor [24] and more de-
4.00
Very Aggressive tails on the implementation of stream-based prefetching can be
Instructions per Cycle
3.50
found in [11, 19, 24]. The modeled prefetcher brings cache
3.00
blocks from the main memory to the last-level cache, which is
2.50
the second-level (L2) cache in our baseline processor.
The stream prefetcher is able to keep track of multiple dif-
2.00
1.50
ferent access streams. For each tracked access stream, a stream
1.00
tracking entry is created in the stream prefetcher. Each tracking
0.50
entry can be in one of four different states:
0.00
gm se
sw k
ga c
fa e
n
x
p
vo r
el
six d
2
c
im
a
re
k
i
e
ea
cf
rte
m
p
i
r
w
tra
ip
pl
es
t
lg
ua
rs
gr
vp
ce
m
am
up
bz
ap
m
pa
m
eq
Figure 1. Performance vs. aggressiveness of the prefetcher keep track of. Initially, all tracking entries are in this state.
2. Allocated: A demand (i.e. load/store) L2 miss allocates a
The goal of this paper is to reduce the negative perfor- tracking entry if the demand miss does not find any exist-
mance and bandwidth impact of aggressive prefetching while ing tracking entry for its cache-block address.
preserving the large performance benefits provided by it. To 3. Training: The prefetcher trains the direction (ascending or
achieve this goal, we propose simple and implementable mech- descending) of the stream based on the next two L2 misses
anisms that dynamically adjust the aggressiveness of the hard- that occur +/- 16 cache blocks from the first miss.4 If the
ware prefetcher as well as the location in the processor cache next two accesses in the stream are to ascending (descend-
where prefetched data is inserted. ing) addresses, the direction of the tracking entry is set to 1
The proposed mechanisms estimate the effectiveness of the (0) and the entry transitions to Monitor and Request state.
prefetcher by monitoring the accuracy and timeliness of the 4. Monitor and Request: The tracking entry monitors the ac-
prefetch requests as well as the cache pollution caused by the cesses to a memory region from a start pointer (address
prefetch requests. We describe simple hardware implemen- A) to an end pointer (address P). The maximum distance
tations to estimate accuracy, timeliness, and cache pollution. between the start pointer and the end pointer is determined
Based on the run-time estimation of these three metrics, the by Prefetch Distance, which indicates how far ahead of the
aggressiveness of the hardware prefetcher is decreased or in- demand access stream the prefetcher can send requests. If
creased dynamically. Also, based on the run-time estimation there is a demand L2 cache access to a cache block in the
of the cache pollution caused by the prefetcher, the proposed monitored memory region, the prefetcher requests cache
mechanism dynamically decides where to insert the prefetched blocks [P+1, ..., P+N] as prefetch requests (assuming the
blocks in the processor cache’s LRU stack. direction of the tracking entry is set to 1). N is called
Our results show that using the proposed dynamic feed- the Prefetch Degree. After sending the prefetch requests,
back mechanisms improve the average performance of 17 the tracking entry starts monitoring the memory region be-
memory-intensive benchmarks in the SPEC CPU2000 suite by tween addresses A+N to P+N (i.e. effectively it moves the
6.5% compared to the best-performing conventional stream- tracked memory region by N cache blocks).5
based prefetcher configuration. With the proposed mechanism, 4 Note that all addresses tracked by the prefetcher are cache-block addresses.
the negative performance impact incurred on some bench- 5 Right after a tracking entry is trained, the prefetcher sets thestart pointer to
marks due to stream-based prefetching is completely elimi- the the first L2 miss address that allocated the tracking entry and the end pointer
nated. Furthermore, the proposed mechanism consumes 18.7% to the last L2 miss address that determined the direction of the entry plus an
less memory bandwidth than the best-performing stream-based initial start-up distance. Until the monitored memory region’s size becomes
prefetcher configuration. Compared to a conventional stream- the same as the Prefetch Distance (in terms of cache blocks), the tracking entry
based prefetcher configuration that consumes similar amount increments only the end pointer by the Prefetch Degree when prefetches are
of memory bandwidth, feedback directed prefetching provides issued (i.e. the end pointer points to the last address requested as a prefetch
and the start pointer points to the L2 miss address that allocated the tracking
13.6% higher performance. We also show that the dynamic entry). After the monitored memory region’s size becomes the same as Prefetch
feedback mechanism works similarly well when implemented Distance, both the start pointer and the end pointer are incremented by Prefetch
to dynamically adjust the aggressiveness of a global-history- Degree (N) when prefetches are issued. This way, the prefetcher is able to send
buffer (GHB) based delta correlation prefetcher [10] or a prefetch requests that are Prefetch Distance ahead of the demand access stream.
2
Prefetch Distance and Prefetch Degree determine the ag- very aggresive prefetcher is used instead of a very conservative
gressiveness of the prefetcher. In a traditional prefetcher con- one. Aggressive prefetching reduces the lateness of prefetches
figuration, the values of Prefetch Distance and Prefetch De- because an aggressive prefetcher generates prefetch requests
gree are fixed at the design time of the processor. In the feed- earlier than a conservative one would.
back directed mechanism we propose, the processor dynami-
cally changes Prefetch Distance and Prefetch Degree to adjust 2.2.3. Prefetcher-Generated Cache Pollution: Prefetcher-
the aggressiveness of the prefetcher. generated cache pollution is a measure of the disturbance
caused by prefetched data in the L2 cache. It is defined as:
2.2. Metrics of Prefetcher Effectiveness
P ref etcher Generated Cache P ollution =
We use three metrics (Prefetch Accuracy, Prefetch Lateness, N umber of Demand M isses Caused By the P ref etcher
and Prefetcher-Generated Cache Pollution) as feedback inputs N umber of Demand M isses
to feedback directed prefetchers. In this section, we define
the metrics and describe the relationship between the metrics
and the performance provided by a conventional prefetcher. A demand miss is defined to be caused by the prefetcher if
We evaluate four configurations: No prefetching, Very Con- it would not have occurred had the prefetcher not been present.
servative prefetching (distance=4, degree=1), Middle-of-the- If the prefetcher-generated cache pollution is high, the perfor-
Road prefetching (distance=16, degree=2), and Very Aggressive mance of the processor can degrade because useful data in the
prefetching (distance=64, degree=4). cache could be evicted by prefetched data. Furthermore, high
2.2.1. Prefetch Accuracy: Prefetch accuracy is a measure cache pollution can also result in higher memory bandwidth
of how accurately the prefetcher can predict the memory ad- consumption by requiring the re-fetch of the displaced data
dresses that will be accessed by the program. It is defined as from main memory.
vortex, prefetch lateness decreases from 70% to 22% when a next-sequential prefetching [5, 21] already employ pref-bits in the cache.
3
5.00 1.00
No prefetching
4.50 Very Conservative 0.90
Middle-of-the-Road
3.00 0.60
2.50 0.50
2.00 0.40
1.50 0.30
gm se
ise
sw k
ck
ga c
c
fa e
fa e
n
x
x
p
p
vo r
el
er
el
six d
id
p2
u
c
im
im
a
a
re
re
k
k
i
e
ea
cf
rte
cf
rte
m
m
p
p
i
r
w
tra
tra
pl
ip
pl
es
es
t
t
lg
lg
ua
ua
rs
rs
gr
gr
vp
vp
ar
ar
ga
ce
ga
ce
i
sw
m
m
am
am
up
up
bz
ap
bz
ap
m
m
pa
pa
vo
ga
m
m
six
eq
eq
w
w
Figure 2. IPC performance (left) and prefetch accuracy (right) with different aggressiveness configurations
5.00 1.00
No prefetching Very Conservative
4.50 Very Conservative 0.90 Middle-of-the-Road
Middle-of-the-Road Very Aggressive
4.00 0.80
Very Aggressive
Instructions per Cycle
3.50 0.70
Prefetch Lateness
3.00 0.60
2.50 0.50
2.00 0.40
1.50 0.30
1.00 0.20
0.50 0.10
0.00 0.00
gm se
ise
sw k
ck
ga c
c
fa e
fa e
n
x
x
p
p
vo r
el
er
el
six d
id
2
u
c
im
im
a
a
re
re
k
k
i
e
ea
cf
rte
cf
rte
m
m
p
p
i
r
w
tra
tra
ip
pl
ip
pl
es
es
t
t
lg
lg
ua
ua
rs
rs
gr
gr
vp
vp
ar
ar
ga
ce
ga
ce
sw
m
m
am
am
up
up
bz
ap
bz
ap
m
m
pa
pa
vo
ga
m
m
six
eq
eq
w
w
Figure 3. IPC performance (left) and prefetch lateness (right) with different aggressiveness configurations
0
CacheBlockAddress[11:0]
cessor needs to store information about all demand-fetched XOR
CacheBlockAddress[23:12]
L2 cache blocks dislodged by the prefetcher. However, such
a mechanism is impractical as it incurs a heavy overhead in
terms of both hardware and complexity. We use the Bloom fil- Pollution Filter
ter concept [2, 20] to provide a simple cost-effective hardware Figure 4. Filter to estimate prefetcher-generated cache pollution
mechanism that can approximate the number of demand misses
3.2. Sampling-based Feedback Collection
caused by the prefetcher.
To adapt to the time-varying memory phase behavior of a
program, we use interval-based sampling for all counters de-
Figure 4 shows the filter that is used to approximate the num- scribed in Section 3.1. Program execution time is divided into
ber of L2 demand misses caused by the prefetcher. The filter intervals and the value of each counter is computed as:
consists of a bit-vector, which is indexed with the output of the
exclusive-or operation of the lower and higher order bits of the 1
cache block address. When a block that was brought into the CounterV alue = 2 CounterV alueAtT heBeginningOf T heInterval
cache due to a demand miss is evicted from the cache due to + 21 CounterV alueDuringInterval (1)
a prefetch request, the filter is accessed with the address of the
evicted cache block and the corresponding bit in the filter is
set (indicating that the evicted cache block was evicted due to The CounterValueDuringInterval is reset at the end of each
a prefetch request). When a prefetch request is serviced from sampling interval. The above equation used to update the
memory, the pollution filter is accessed with the cache-block counters (Equation 1) gives more weight to the behavior of
address of the prefetch request and the corresponding bit in the the program in the most recent interval while taking into ac-
filter is reset, indicating that the block was inserted into the count the behavior in all previous intervals. Our mechanism
cache. When a demand access misses in the cache, the filter is defines the length of an interval based on the number of useful
accessed using the cache-block address of the demand request. cache blocks evicted from the L2 cache.7 A hardware counter,
If the corresponding bit in the filter is set, it is an indication eviction-count, keeps track of the number of blocks evicted
that the demand miss was caused by the prefetcher. In such from the L2 cache. When the value of the counter exceeds a
cases, the hardware counter, pollution-total, that keeps track of statically-set threshold Tinterval , the interval ends. At the end
the total number of demand misses caused by the prefetcher of an interval, all counters described in Section 3.1 are updated
is incremented. Another counter, demand-total, keeps track of according to Equation 1. The updated counter values are then
the total number of demand misses generated by the proces- 7 There are other ways to define the length of an interval, e.g. based on the
sor and is incremented for each demand miss. Cache pollution
number of instructions executed. We use the number of useful cache blocks
caused by the prefetcher can be computed by taking the ratio of evicted to define an interval because this metric provides a more accurate view
pollution-total to demand-total. We use a 4096-entry bit vector of the memory behavior of a program than the number of instructions executed.
in our experiments.
4
used to compute the three metrics: accuracy, lateness, and pol- each case empirically. If the prefetches are causing pollu-
lution. These metrics are used to adjust the prefetcher behavior tion (all even-numbered cases), the prefetcher is adjusted to be
for the next interval. The eviction-count register is reset and a less aggressive to reduce cache pollution and to save memory
new interval begins. In our experiments, we use a value of 8192 bandwidth (except in Case 2 when the accuracy is high and
(half the number of blocks in the L2 cache) for Tinterval . prefetches are late – we do increase aggressiveness in this case
to gain more benefit from highly-accurate prefetches). If the
3.3. Dynamically Adjusting Prefetcher Behavior prefetches are late but not polluting (Cases 1, 5, 9), the aggres-
At the end of each sampling interval, the computed values siveness is increased to increase timeliness unless the prefetch
of the accuracy, lateness, and pollution metrics are used to dy- accuracy is low (Case 9 – we reduce aggressiveness in this
namically adjust prefetcher behavior. Prefetcher behavior is ad- case because a large fraction of inaccurate prefetches will waste
justed in two ways: (1) by adjusting the aggressiveness of the memory bandwidth). If the prefetches are neither late nor pol-
prefetching mechanism, (2) by adjusting the location in the L2 luting (Cases 3, 7, 11), the aggressiveness is left unchanged.
cache’s LRU stack where prefetched blocks are inserted.8
3.3.1. Adjusting Prefetcher Aggressiveness: The aggres- 3.3.2. Adjusting Cache Insertion Policy of Prefetched
siveness of the prefetcher directly determines the potential for Blocks: FDP also adjusts the location in which a prefetched
benefit as well as harm that is caused by the prefetcher. By block is inserted in the LRU-stack of the corresponding cache
dynamically adapting this parameter based on the collected set based on the observed behavior of the prefetcher. In many
feedback information, the processor can not only achieve the cache implementations, prefetched cache blocks are simply
performance benefits of aggressive prefetching during pro- inserted into the Most-Recently-Used (MRU) position in the
gram phases where aggressive prefetching performs well but LRU-stack, since such an insertion policy does not require any
also eliminate the negative performance and bandwidth im- changes to the cache implementation. Inserting the prefetched
pact of aggressive prefetching during phases where aggressive blocks into the MRU position can allow the prefetcher to be
prefetching performs poorly. more aggressive and request data long before its use because
As shown in Table 1, our baseline stream prefetcher has five this insertion policy allows the useful prefetched blocks to stay
different configurations ranging from Very Conservative to Very longer in the cache. However, if the prefetched cache blocks
Aggressive. The aggressiveness of the stream prefetcher is de- create cache pollution, having a different cache insertion policy
termined by the Dynamic Configuration Counter, a 3-bit satu- for prefetched cache blocks can help reduce the cache pollution
rating counter that saturates at values 1 and 5. The initial value caused by the prefetcher. A prefetched block that is not use-
of the Dynamic Configuration Counter is set to 3, indicating ful creates more pollution in the cache if it is inserted into the
Middle-of-the-Road aggressiveness. MRU position rather than a less recently used position because
Dyn. Config. Counter Aggressiveness Pref. Distance Pref. Degree
it stays in the cache for a longer time period, occupying cache
1 Very Conservative 4 1
space that could otherwise be allocated to a useful demand-
2 Conservative 8 1 fetched cache block. Therefore, if the prefetch requests are
3 Middle-of-the-Road 16 2 causing cache pollution, it would be desirable to reduce this
4 Aggressive 32 4
5 Very Aggressive 64 4
pollution by changing the location in the LRU stack in which
Table 1. Stream prefetcher configurations
prefetched blocks are inserted.
We propose a simple heuristic that decides where in the LRU
stack of the L2 cache set a prefetched cache block is inserted
At the end of each sampling interval, the value of the Dy-
based on the estimated prefetcher-generated cache pollution.
namic Configuration Counter is updated based on the com-
At the end of a sampling interval, the estimated cache pollu-
puted values of the accuracy, lateness, and pollution metrics.
tion metric is compared to two thresholds (Plow and Phigh )
The computed accuracy is compared to two thresholds (Ahigh
to determine whether the pollution caused by the prefetcher
and Alow ) and is classified as high, medium or low. Simi-
was low, medium, or high. If the pollution caused by the
larly, the computed lateness is compared to a single thresh-
prefetcher was low, the prefetched cache blocks are inserted
old (Tlateness ) and is classified as either late or not-late. Fi-
into the middle (MID) position in the LRU stack during the
nally, the computed pollution is compared to a single threshold
next sampling interval (for an n-way set-associative cache, we
(Tpollution ) and is classified as high (polluting) or low (not-
define the MID position in the LRU stack as the floor(n/2)th
polluting). We use static thresholds in our mechanisms. The
least-recently-used position).9 On the other hand, if the pol-
effectiveness of our mechanism can be improved by dynami-
lution caused by the prefetcher was medium, prefetched cache
cally tuning the values of these thresholds and/or using more
blocks are inserted into the LRU-4 position in the LRU stack
thresholds, but such optimization is out of the scope of this pa-
(for an n-way set-associative cache, we define the LRU-4 posi-
per. In Section 5, we show that even with untuned threshold
tion in the LRU stack as the floor(n/4)th least-recently-used
values, FDP can significantly improve performance and reduce
position). Finally, if the pollution caused by the prefetcher was
memory bandwidth consumption on different data prefetchers.
high, prefetched cache blocks are inserted into the LRU posi-
Table 2 shows in detail how the estimated values of the tion during the next sampling interval.
three metrics are used to adjust the dynamic configuration of
the prefetcher. We determined the counter update choice for
9 We found inserting prefetched blocks to the MRU position doesn’t pro-
8 Note that we adjust prefetcher behavior on a global (across-streams) basis vide significant benefits over inserting them to the MID position. Thus, our
rather than on a per-stream basis as we did not find much benefit in adjusting dynamic mechanism doesn’t insert prefetched blocks to the MRU position. For
on a per-stream basis. a detailed analysis of the cache insertion policy, see Section 5.2.
5
Case Prefetch Accuracy Prefetch Lateness Cache Pollution Dynamic Configuration Counter Update (reason)
1 High Late Not-Polluting Increment (to increase timeliness)
2 High Late Polluting Increment (to increase timeliness)
3 High Not-Late Not-Polluting No Change (best case configuration)
4 High Not-Late Polluting Decrement (to reduce pollution)
5 Medium Late Not-Polluting Increment (to increase timeliness)
6 Medium Late Polluting Decrement (to reduce pollution)
7 Medium Not-Late Not-Polluting No Change (to keep the benefits of timely prefetches)
8 Medium Not-Late Polluting Decrement (to reduce pollution)
9 Low Late Not-Polluting Decrement (to save bandwidth)
10 Low Late Polluting Decrement (to reduce pollution)
11 Low Not-Late Not-Polluting No Change (to keep the benefits of timely prefetches)
12 Low Not-Late Polluting Decrement (to reduce pollution and save bandwidth)
Table 2. How to adapt? Use of the three metrics to adjust the aggressiveness of the prefetcher
6
Pipeline 20-cycle minimum branch misprediction penalty; 4 GHz processor
Branch Predictor aggressive hybrid branch predictor (64K-entry gshare, 64K-entry per-address w/ 64K-entry selector)
wrong-path execution faithfully modeled
Instruction Window 128-entry reorder buffer; 128-entry INT, 128-entry FP physical register files; 64-entry store buffer;
Execution Core 8-wide, fully-pipelined except for FP divide; full bypass network
On-chip Caches 64KB Instruction cache with 2-cycle latency;
64KB, 4-way L1 data cache with 8 banks and 2-cycle latency, allows 4 load accesses per cycle;
1MB, 16-way, unified L2 cache with 8 banks and 10-cycle latency, 128 L2 MSHRs,
1 L2 read port, 1 L2 write port; all caches use LRU replacement and have 64B block size
Buses and Memory 500-cycle minimum main memory latency; 32 DRAM banks; 32B-wide, split-transaction
core-to-memory bus at 4:1 frequency ratio; 4.5 GB/s bus bandwidth; max. 128 outstanding misses to main memory;
bank conflicts, bandwidth, port contention, and queueing delays faithfully modeled
Table 3. Baseline processor configuration
bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr
336K 59K 4969 1656K 110K 31K 2585K 515K 9218 2749 591K 246K
ammp applu apsi art equake facerec fma3d galgel lucas mesa mgrid sixtrack swim wupwise
1157K 6038K 8656 13319K 2414K 2437K 3643 243K 1103 273K 2185K 292K 8766K 799K
Table 4. Number of prefetches sent by a very aggressive stream prefetcher for each benchmark in the SPEC CPU2000 suite
100.0
able to detect and employ the best-performing aggressiveness 90.0 Very Aggressive (5)
level for the stream prefetcher on a per-benchmark basis.
ise
ck
c
fa e
x
p
er
el
id
2
u
applu and only a 5.9% performance loss on ammp compared
im
a
re
k
cf
rte
m
p
w
tra
ip
pl
es
t
lg
ua
rs
gr
vp
ar
ga
ce
sw
m
am
up
bz
ap
m
pa
vo
ga
m
six
eq
w
to no prefetching, similar to the best-performing traditional
Figure 6. Distribution of the dynamic aggressiveness level
prefetcher configuration for the two benchmarks.
5.00
No prefetching 5.2. Adjusting Cache Insertion Policy of Prefetches
4.50 Very Conservative
4.00
Middle-of-the-Road
Very Aggressive
Figure 7 shows the performance of dynamically adjust-
Dynamic Aggressiveness ing the cache insertion policy (i.e. Dynamic Insertion) us-
Instructions per Cycle
3.50
prefetcher configuration.
x
p
vo r
el
six d
2
c
im
a
re
k
i
e
ea
cf
rte
m
p
i
r
w
tra
ip
pl
es
t
lg
ua
rs
gr
vp
ar
ga
ce
m
am
up
bz
ap
m
pa
m
eq
5.0
No prefetching
Figure 5. Dynamic adjustment of prefetcher aggressiveness 4.5 LRU
LRU-4
4.0 MID
5.1.1. Adapting to the Program Figure 6 shows the distri- MRU
Instructions per Cycle
3.5
Dynamic Insertion
bution of the value of the Dynamic Configuration Counter over 3.0
tion (counter value of 1) for most of the sampling intervals. For 0.5
in more than 98% of the intervals for both applu and ammp.
x
p
vo r
el
id
2
c
im
a
re
k
i
e
ea
cf
rte
m
p
w
tra
ip
pl
es
t
lg
ua
rs
gr
vp
ar
ga
ce
m
am
up
bz
ap
m
pa
m
six
eq
7
5.0
in the LRU position causes an aggressive prefetcher to evict 4.5
No prefetching
Very Aggressive
prefetched blocks before they get used by demand loads/stores. 4.0
Dynamic Insertion
However, inserting in the LRU position eliminates the per- Dynamic Aggressiveness
gm se
sw k
higher performance than any of the static insertion policies. Dy-
ga c
fa e
n
x
p
vo r
el
id
2
c
im
a
re
k
i
e
ea
cf
rte
m
p
w
tra
ip
pl
es
t
lg
ua
rs
gr
vp
ar
ga
ce
m
am
up
bz
ap
m
pa
m
six
eq
w
namic Insertion achieves 5.1% better performance than insert-
ing prefetched blocks into the MRU position and 1.9% better Figure 9. Overall performance of FDP
performance than inserting them into the LRU-4 position. Fur- improvement provided by Dynamic Aggressiveness or Dynamic
thermore, Dynamic Insertion almost always provides the per- Insertion alone. Hence, dynamically adjusting both aspects of
formance of the best static insertion policy for each benchmark. prefetcher behavior (aggressiveness and insertion policy) pro-
Hence, dynamically adapting the prefetch insertion policy us- vides complementary performance benefits.
ing run-time estimates of prefetcher-generated cache pollution With the use of FDP to dynamically adjust both aspects of
is able to detect and employ the best-performing cache inser- prefetcher behavior, the performance loss incurred on some
tion policy for the stream prefetcher on a per-benchmark basis. benchmarks due to aggressive prefetching is completely elim-
Figure 8 shows the distribution of the insertion position of inated. No benchmark loses performance compared to no
the prefetched blocks when Dynamic Insertion is used. For prefetching if both Dynamic Aggressiveness and Dynamic In-
benchmarks where a static policy of inserting prefetched blocks sertion are used. In fact, FDP improves the performance of
into the LRU position provides the best performance across all applu by 13.4% and ammp by 11.4% over no prefetching – two
static configurations (applu, galgel, ammp), Dynamic Insertion benchmarks that otherwise incur very significant performance
places most (more than 50%) of the prefetched blocks into the losses with an aggressive traditional prefetcher configuration.
LRU position. Therefore, Dynamic Insertion improves the per-
formance of these benchmarks by dynamically employing the 5.4. Impact of FDP on Bandwidth Consumption
best-performing insertion policy. Aggressive prefetching can adversely affect the bandwidth
100 consumption in the memory system when prefetches are not
90 used or when they cause cache pollution. Figure 10 shows
Percentage of prefetch insertions
p
er
el
id
2
im
a
re
k
cf
rte
m
p
w
tra
ip
pl
es
t
lg
gr
vp
ar
ga
ce
sw
m
am
up
bz
ap
m
pa
vo
ga
m
six
eq
(5) Dynamic Aggressiveness and Dynamic Insertion together. bandwidth metric, because this metric includes the effect of L2 misses caused
Using Dynamic Aggressiveness and Dynamic Insertion to- due to demand accesses as well as prefetches. If the prefetcher is polluting the
cache, then the number of L2 misses due to demand accesses also increases.
gether provides the best performance across all configurations, Hence, counting the number of bus accesses provides a more accurate measure
improving the IPC by 6.5% over the best-performing traditional of the memory bandwidth consumed by the prefetcher.
prefetcher configuration (i.e. Very Aggressive configuration). 12 Middle-of-the-Road configuration consumes only 2.5% less memory
This performance improvement is greater than the performance bandwidth than FDP.
8
70
4.0
8KB
10 2.0
1.5
0
am se
sw k
ga c
fa e
n
x
p
vo r
el
six d
p2
c
im
a
re
k
i
e
ea
cf
rte
1.0
p
i
r
w
tra
pl
es
t
lg
ua
rs
gr
vp
ar
ga
ce
i
am
up
bz
ap
m
pa
m
eq
w
0.5
gm se
sw k
ga c
fa e
n
x
p
vo r
el
id
2
c
im
a
re
k
i
e
ea
cf
rte
m
p
w
tra
ip
pl
es
t
lg
ua
rs
No pref. Very Cons. Middle Very Aggr. FDP
gr
vp
ar
ga
ce
m
am
up
bz
ap
m
pa
m
six
eq
w
IPC 0.85 1.21 1.47 1.57 1.67 Figure 11. Performance of prefetch cache vs. FDP
BPKI 8.56 9.34 10.60 13.38 10.88
100
Table 5. Average IPC and BPKI for FDP vs conventional prefetchers 90 No prefetching
Very Aggressive (base)
80
2KB, fully-associative
70 8KB, 16-way
32KB, 16-way
5.5. Hardware Cost and Complexity of FDP 60 64KB, 16-way
BPKI
1MB, 16-way
50
Table 6 shows the hardware cost of the proposed mecha- Dyn. Aggr. + Dyn. Ins.
40
nism in terms of the required state. FDP does not add signifi- 30
cant combinational logic complexity to the processor. Combi- 20
national logic is required for the update of counters, update of 10
the pref-bits in the L2 cache, update of the entries in the pol- 0
lution filter, calculation of feedback metrics at the end of each
am se
sw k
ga c
fa e
n
x
p
vo r
el
id
2
c
im
a
re
k
i
e
ea
cf
rte
m
p
w
tra
ip
pl
es
t
lg
ua
rs
gr
vp
ar
ga
ce
m
am
up
bz
ap
m
pa
sampling interval, determination of when a sampling interval
m
six
eq
w
ends, and insertion of prefetched blocks into appropriate loca- Figure 12. Bandwidth consumption of prefetch cache vs. FDP
tions in the LRU stack of an L2 cache set. None of the required The results show that using small (2KB and 8KB) prefetch
logic is on the critical path of the processor. The storage over- caches do not provide as high performance as inserting the
head of our mechanism is less than 0.25% of the data-store size prefetched data into the L2 cache. With an aggressive
of the baseline 1MB L2 cache. prefetcher and a small prefetch cache, the prefetched blocks
are displaced by later prefetches before being used by the pro-
5.6. Using only Prefetch Accuracy for Feedback gram - which results in performance degradation with a small
We use a comprehensive set of metrics –prefetch accuracy, prefetch cache. However, larger prefetch caches (32KB and
timeliness, and pollution– in order to provide feedback to ad- larger) improve performance compared to inserting prefetched
just the prefetcher aggressiveness. In order to assess the ben- data into the L2 cache because a larger prefetch cache reduces
efit of using timeliness as well as cache pollution, we evalu- the pollution caused by prefetched data in the L2 cache while
ated a mechanism where we adapted the prefetcher aggressive- providing enough space for prefetched blocks.
ness based only on accuracy. In such a scheme, we increment Using FDP (both Dynamic Aggressiveness and Dynamic
the Dynamic Configuration Counter if the accuracy is high and Insertion) that prefetches into the L2 cache provides 5.3%
decrement it if the accuracy is low. We found that, compared to higher performance than that provided by augmenting the Very
this scheme that only uses accuracy to throttle the aggressive- Aggressive traditional prefetcher configuration with a 32KB
ness of a stream prefetcher, our comprehensive mechanism that prefetch cache. The performance of FDP is also within 2%
also takes into account timeliness and cache pollution provides of the performance of the Very Aggressive configuration with
3.4% higher performance and consumes 2.5% less bandwidth. a 64KB prefetch cache. Furthermore, the memory bandwidth
consumption of FDP is 16% and 9% less than the Very Ag-
5.7. FDP vs. Using a Prefetch Cache gressive prefetcher configurations with respectively a 32KB
Cache pollution caused by prefetches can be eliminated by and 64KB prefetch cache. Hence, FDP achieves the perfor-
bringing prefetched data into separate prefetch buffers [13, 11] mance provided by a relatively large prefetch cache bandwidth-
rather than inserting prefetched data into the L2 cache. Fig- efficiently and without requiring as large hardware cost and
ures 11 and 12 respectively show the performance and band- complexity as that introduced by the addition of a prefetch
width consumption of the Very Aggressive prefetcher with dif- cache that is larger than 32KB.
ferent prefetch cache sizes - ranging from a 2KB fully-associate
prefetch cache to a 1MB 16-way prefetch cache.13 The per- 5.8. Effect on a Global History Buffer Prefetcher
We have also implemented FDP on the C/DC (C-Zone
13 In the configurations with a prefetch cache, a prefetched cache block is Delta Correlation) variant of the Global History Buffer (GHB)
moved from the prefetch cache into the L2 cache if it is accessed by a demand prefetcher [10]. In order to vary the aggressiveness of this
load/store request. The block size of the prefetch cache and the L2 cache are
the same and the prefetch cache is assumed to be accessed in parallel with the L2 cache without any adverse latency impact on L2 cache access time.
9
pref-bit for each tag-store entry in the L2 cache 16384 blocks * 1 bit/block = 16384 bits
Pollution Filter 4096 entries * 1 bit/entry = 4096 bits
16-bit counters used to estimate feedback metrics 11 counters * 16 bits/counter = 176 bits
pref-bit for each MSHR entry 128 entries * 1 bit/entry = 128 bits
Total hardware cost 20784 bits = 2.54 KB
Percentage area overhead compared to baseline 1MB L2 cache 2.5KB/1024KB = 0.24%
Table 6. Hardware cost of feedback directed prefetching
prefetcher dynamically, we vary the Prefetch Degree.14 Be- cause the effectiveness of the prefetcher becomes more impor-
low, we show the aggressiveness configurations used for the tant when memory becomes a larger performance bottleneck.
GHB prefetcher. FDP adjusts the configuration of the GHB
prefetcher as described in Section 3.3. 5.11. Effect on Other SPEC CPU2000 Benchmarks
Figure 14 shows the IPC and BPKI impact of FDP on the
Dyn. Config. Counter Aggressiveness Prefetch Degree
remaining 9 SPEC CPU2000 benchmarks that have less po-
1 Very Conservative 4
tential. We find that our feedback directed scheme provides
2 Conservative 8
3 Middle-of-the-Road 16
0.4% performance improvement over the best performing con-
4 Aggressive 32 ventional prefetcher configuration (i.e. Middle-of-the-Road
5 Very Aggressive 64 configuration) while reducing the bandwidth consumption by
0.2%. None of the benchmarks lose performance with FDP.
Figure 13 shows the performance and bandwidth consump- Note that the best-performing conventional configuration for
tion of different GHB prefetcher configurations and the feed- these 9 benchmarks is not the same as the best-performing
back directed GHB prefetcher using both Dynamic Aggres- conventional configuration for the 17 memory-intensive bench-
siveness and Dynamic Insertion. The feedback directed GHB marks (i.e. Very-Aggressive configuration). Also note that the
prefetcher performs similarly to the best-performing traditional remaining 9 benchmarks are not bandwidth-intensive except
configuration (Very Aggressive configuration), while it con- for fma3d and gcc. In gcc, the performance improvement of
sumes 20.8% less memory bandwidth. Compared to the tra- FDP is 3.0% over the Middle-of-the-Road configuration. The
ditional GHB prefetcher configuration that consumes similar prefetcher pollutes the L2 cache and evicts many useful in-
amount of memory bandwidth as FDP (i.e. Middle-of-the- struction blocks in gcc, resulting in very long-latency instruc-
Road configuration), FDP provides 9.9% higher performance. tion cache misses that leave the processor idle. Using FDP re-
Hence, FDP significantly increases the bandwidth-efficiency of duces this negative effect by detecting the pollution caused by
GHB-based delta correlation prefetching. Note that it is possi- prefetch references and dynamically reducing the aggressive-
ble to improve the performance and bandwidth benefits of the ness of the prefetcher.
proposed mechanism by tuning the thresholds used in feedback
mechanisms to the behavior of the GHB-based prefetcher, but 6. Related Work
we did not pursue this option. Even though mechanisms for prefetching have been studied
for a long time, dynamic mechanisms to adapt the aggressive-
5.9. Effect of FDP on a PC-Based Stride Prefetcher ness of the prefetcher have not been studied as extensively as
We also evaluated FDP on a PC-based stride prefetcher [1] algorithms that decide what to prefetch. We briefly describe
and found that the results are similar to those achieved on previous work in dynamic adaptation of prefetching policies.
both stream and GHB-based prefetchers. On average, using
the feedback directed approach results in a 4% performance 6.1. Dynamic Adaptation of Data Prefetching Policies
gain and a 24% reduction in memory bandwidth compared to
the best-performing conventional configuration for a PC-based The work most related to ours in adapting the prefetcher’s
stride prefetcher. Due to space constraints, we do not present aggressiveness is Dahlgren et al.’s paper that proposed adap-
detailed graphs for these results. tive sequential (next-line) prefetching [4] for multiproces-
sors. This mechanism implemented two counters to count
5.10. Sensitivity to L2 Size and Memory Latency the number of sent prefetches (counter-sent) and the number
We evaluate the sensitivity of FDP to different cache sizes of useful prefetches (counter-used). When counter-sent sat-
and memory latencies. In these experiments, we varied the L2 urates, counter-used is compared to a static threshold to de-
cache size keeping the memory latency at 500 cycles (base- cide whether to increase or decrease the aggressiveness (i.e.
line) and varied the memory latency keeping the cache size Prefetch Distance) of the prefetcher. While Dahlgren et al.’s
at 1MB (baseline). Table 7 shows the change in average IPC mechanism to calculate prefetcher accuracy is conceptually
and BPKI provided by FDP over the best performing con- similar to ours, their approach considered only prefetch accu-
ventional prefetcher configuration. FDP provides better per- racy to dynamically adapt prefetch distance. Also, their mech-
formance and consumes significantly less bandwidth than the anism is designed for a simple sequential prefetching mech-
best-performing conventional prefetcher configuration for all anism which prefetches up to 8 cache blocks following each
evaluated cache sizes and memory latencies. As memory la- cache miss. In this paper, we provide a generalized feedback-
tency increases, the IPC improvement of FDP also increases be- directed approach for dynamically adjusting the aggressiveness
of a wide range of state-of-the-art hardware data prefetchers by
14 In the GHB-based prefetching mechanism, Prefetch Distance and Prefetch taking into account not only accuracy but also timeliness and
Degree are the same. pollution.
10
4.5 45
No prefetching No prefetching
4.0 Very Conservative 40 Very Conservative
Middle-of-the-Road Middle-of-the-Road
3.5 Very Aggressive 35 Very Aggressive
Instructions per Cycle
2.5 25
BPKI
2.0 20
1.5 15
1.0 10
0.5 5
0.0 0
am se
gm se
sw k
sw k
ga c
ga c
fa e
fa e
n
x
x
p
p
vo r
el
vo r
el
id
id
u
p2
c
im
c
im
a
a
re
re
k
k
i
i
e
e
ea
ea
cf
rte
cf
rte
m
m
p
p
w
r
tra
tra
ip
pl
pl
es
es
t
t
lg
lg
ua
ua
rs
rs
gr
gr
vp
vp
ar
ar
ga
ce
ga
ce
i
m
m
am
am
up
up
bz
ap
bz
ap
m
m
pa
pa
m
six
m
six
eq
eq
w
w
Figure 13. Effect of FDP on the IPC performance (left) BPKI memory bandwidth consumption (right) of GHB-based C/DC prefetchers
L2 Cache Size (memory latency = 500 cycles) Memory Latency (L2 cache size = 1 MB)
512 KB 1 MB 2 MB 250 cycles 500 cycles 1000 cycles
∆ IPC ∆ BPKI ∆ IPC ∆ BPKI ∆ IPC ∆ BPKI ∆ IPC ∆ BPKI ∆ IPC ∆ BPKI ∆ IPC ∆ BPKI
0% -13.9% 6.5% -18.7% 6.3% -29.6% 4.5% -23.0% 6.5% -18.7% 8.4% -16.9%
Table 7. Change in IPC and BPKI with FDP when L2 size and memory latency are varied
5.0 5
No prefetching No prefetching
Very Conservative Very Conservative
4.5
Middle-of-the-Road Middle-of-the-Road
4.0 Very Aggressive 4 Very Aggressive
Dyn. Aggressive. + Dyn. Insertion Dyn. Aggressive. + Dyn. Insertion
Instructions per Cycle
3.5
3.0 3
BPKI
2.5
2.0 2
1.5
1.0 1
0.5
0.0 0
k
k
n
n
d
d
ty
m
ty
s
f
ip
ip
si
ea
ea
a3
si
a3
ol
c
n
ol
c
ca
ca
af
rlb
af
rlb
gc
eo
gc
eo
ap
ap
gz
gz
tw
tw
gm
am
fm
fm
lu
lu
cr
cr
pe
pe
Figure 14. IPC performance (left) and memory bandwidth consumption in BPKI (right) impact of FDP on the remaining SPEC benchmarks
When the program enters a new phase of execution, the need to keep history information for evicted L2 cache blocks.
prefetcher is tuned based on the characteristics of the phase Zhuang and Lee [25] proposed to filter prefetcher-generated
in Nesbit et al. [10]. In order to perform phase detec- cache pollution by using schemes similar to two-level branch
tion/prediction and identification of the best prefetcher config- predictors. Their mechanism tries to identify whether or not
uration for a given phase, significant amount of extra hardware a prefetch will be useful based on past information about the
is needed. In comparison, our mechanism is simpler because it usefulness of the prefetches generated to the same memory ad-
does not require phase detection or prediction mechanisms. dress or triggered by the same load instruction. In contrast, our
Recently, Hur and Lin [7] proposed a probabilistic technique mechanism does not require the collection of fine-grain infor-
that adjusts the aggressiveness of a stream prefetcher based on mation on each prefetch address or load address in order to vary
the estimated spatial locality of the program. Their approach is the aggressiveness of the prefetcher.
applicable only to stream prefetchers as it tries to estimate the Other approaches for cache pollution filtering include using
a histogram of the stream length. a profiling mechanism to mark load instructions that can trigger
hardware prefetches [23], and using compile-time techniques to
6.2. Cache Pollution Filtering mark dead cache locations so that prefetches can be inserted in
Charney and Puzak [3] proposed filtering L1 cache pollu- dead locations [9]. In comparison to these two mechanisms,
tion caused by next-sequential prefetching and shadow direc- our mechanism does not require any software or ISA support
tory prefetching from the L2 cache into the L1 cache. Their and can adjust to dynamic program behavior even if it differs
scheme associates a confirmation bit with each block in the L2 from the behavior of the compile-time profile. Lin et al. [15]
cache which indicates if the block was used by a demand ac- proposed using density vectors to determine what to prefetch
cess when it was prefetched into the L1 cache the last time. inside a region. This was especially useful in their model as
If the confirmation bit is not set when a prefetch request ac- they used very bandwidth-intensive scheduled region prefetch-
cesses the L2, the prefetch request is discarded. Extending this ing, which prefetches all the cache blocks in a memory region
scheme to prefetching from main memory to the L2 cache re- on a cache miss. This approach can be modified and combined
quires a separate structure that maintains information about the with our proposal to further remove the pollution caused by
blocks evicted from the L2 cache. This significantly increases blocks that are not used in a prefetch stream.
the hardware cost of their mechanism. Our mechanism does not Mutlu et al. [17] used the L1 caches as filters to reduce L2
11
cache pollution caused by useless prefetches. In their scheme, used as part of the selection mechanism in a hybrid prefetcher.
all prefetched blocks are placed into only the L1 cache. A Finally, the mechanisms proposed in this paper can be easily
prefetched block is placed into the L2 when it is evicted from extended to instruction prefetchers.
the L1 cache only if it was needed by a demand request while it
was in L1. In addition to useless prefetches, this approach also Acknowledgments
filters out some useful but early prefetches that are not used We thank Matthew Merten, Moinuddin Qureshi, members
while residing in the L1 cache (such prefetches are common of the HPS Research Group, and the anonymous reviewers for
in very aggressive prefetchers). To obtain performance benefit their comments and suggestions. We gratefully acknowledge
from such prefetches, their scheme can be combined with our the support of the Cockrell Foundation, Intel Corporation and
cache insertion policy. the Advanced Technology Program of the Texas Higher Edu-
cation Coordinating Board.
6.3. Cache Insertion Policy for Prefetches
Lin et al. [14] evaluated static policies to determine the References
[1] J. Baer and T. Chen. An effective on-chip preloading scheme to reduce
placement in cache of prefetches generated by a scheduled re- data access penalty. In Proceedings of Supercomputing ’91, 1991.
gion prefetcher. Their scheme placed prefetches in the LRU po- [2] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors.
sition of the LRU stack. We found that, even though inserting Communications of the ACM, 13(7):422–426, 1970.
prefetches in the LRU position reduces the cache pollution ef- [3] M. Charney and T. Puzak. Prefetching and memory system behavior of
the SPEC95 benchmark suite. IBM Journal of Reseach and Development,
fects of prefetches on some benchmarks, it also reduces the pos- 41(3):265–286, 1997.
itive benefits of aggressive stream prefetching on other bench- [4] F. Dahlgren, M. Dubois, and P. Stenström. Sequential hardware prefetch-
marks because useful prefetches–if placed in the LRU position– ing in shared-memory multiprocessors. IEEE Transactions on Parallel
can be easily evicted from the cache in an aggressive prefetch- and Distributed Systems, 6(7):733–746, 1995.
[5] J. D. Gindele. Buffer block prefetching method. IBM Technical Disclo-
ing scheme without providing any benefit. Dynamically adjust- sure Bulletin, 20(2):696–697, July 1977.
ing the insertion policy of prefetched blocks based on the esti- [6] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and
mated pollution increases performance by 1.9% over the best P. Roussel. The microarchitecture of the Pentium 4 processor. Intel Tech-
static policy (LRU-4) and by 18.8% over inserting prefetches nology Journal, Feb. 2001. Q1 2001 Issue.
[7] I. Hur and C. Lin. Memory prefetching using adaptive stream detection.
in the LRU position. In MICRO-39, 2006.
[8] S. Iacobovici, L. Spracklen, S. Kadambi, Y. Chou, and S. G. Abraham.
7. Conclusion and Future Work Effective stream-based and execution-based data prefetching. In ICS,
2004.
This paper proposed a feedback directed mechanism that dy- [9] P. Jain, S. Devadas, and L. Rudolph. Controlling cache pollution in
namically adjusts the behavior of a hardware data prefetcher prefetching with software-assisted cache replacement. Technical Report
to improve performance and reduce memory bandwidth con- CSG-462, Massachusetts Institute of Technology, 2001.
sumption. Over previous research in adaptive prefetching, our [10] K. J.Nesbit, A. S. Dhodapkar, and J. E.Smith. AC/DC: An adaptive data
cache prefetcher. In PACT, 2004.
contributions are: [11] N. P. Jouppi. Improving direct-mapped cache performance by the addi-
tion of a small fully-associative cache and prefetch buffers. In ISCA-17,
• We propose a comprehensive and low-cost feedback 1990.
mechanism that takes into account prefetch accuracy, [12] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In
timeliness, and cache pollution caused by prefetch re- ISCA-8, 1981.
[13] R. L. Lee, P.-C. Yew, and D. H. Lawrie. Data prefetching in shared mem-
quests together to both throttle the aggressiveness of the ory multiprocessors. In ICPP, 1987.
prefetcher and to decide where in the cache to place the [14] W.-F. Lin, S. K. Reinhardt, and D. Burger. Reducing DRAM latencies
prefetched blocks. Previous approaches considered using with an integrated memory hierarchy design. In HPCA-7, 2001.
only prefetch accuracy to determine the aggressiveness of [15] W.-F. Lin, S. K. Reinhardt, D. Burger, and T. R. Puzak. Filtering super-
fluous prefetches using density vectors. In ICCD, 2001.
simple sequential (next-line) prefetchers. [16] O. Mutlu, H. Kim, D. N. Armstrong, and Y. N. Patt. An analysis of the
performance impact of wrong-path memory references on out-of-order
• We develop a low-cost mechanism to estimate at run-time and runahead execution processors. IEEE Transactions on Computers,
the cache pollution caused by hardware prefetching. 54(12):1556–1571, Dec. 2005.
[17] O. Mutlu, H. Kim, D. N. Armstrong, and Y. N. Patt. Using the first-level
• We propose and evaluate using comprehensive feedback caches as filters to reduce the pollution caused by speculative memory
mechanisms for state-of-the-art stream prefetchers that are references. International Journal of Parallel Programming, 33(5):529–
559, October 2005.
commonly employed by today’s high-performance pro- [18] O. Mutlu, H. Kim, and Y. N. Patt. Techniques for efficient processing in
cessors. Our feedback-directed mechanism is applicable runahead execution engines. In ISCA-32, 2005.
to any kind of hardware data prefetcher. We show that it [19] S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary
works well with stream-based prefetchers, global-history- cache replacement. In ISCA-21, 1994.
[20] J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai. Bloom filtering cache
buffer based prefetchers and PC-based stride prefetchers. misses for accurate data speculation and prefetching. In ICS, 2002.
Previous adaptive mechanisms were applicable to only [21] A. J. Smith. Cache memories. Computing Surveys, 14(4):473–530, 1982.
simple sequential prefetchers [4]. [22] L. Spracklen and S. G. Abraham. Chip multithreading: Opportunities and
challenges. In HPCA-11, 2005.
Future work can incorporate other important metrics, such [23] V. Srinivasan, G. S. Tyson, and E. S. Davidson. A static filter for reducing
prefetch traffic. Technical Report CSE-TR-400-99, University of Michi-
as available memory bandwidth, estimates of the contention in gan Technical Report, 1999.
the memory system, and prefetch coverage, into the dynamic [24] J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 system
feedback mechanism to provide further improvement in perfor- microarchitecture. IBM Technical White Paper, Oct. 2001.
mance and further reduction in memory bandwidth consump- [25] X. Zhuang and H.-H. S. Lee. A hardware-based cache pollution filtering
mechanism for aggressive prefetches. In ICPP-32, 2003.
tion. The metrics defined and used in this paper could also be
12