0% found this document useful (0 votes)
16 views15 pages

Merging Similar Patterns For Hardware Prefetching

Uploaded by

953950914
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views15 pages

Merging Similar Patterns For Hardware Prefetching

Uploaded by

953950914
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)

Merging Similar Patterns for Hardware Prefetching


2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO) | 978-1-6654-6272-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/MICRO56248.2022.00071

Shizhi Jiang∗ , Qiusong Yang † , and Yiwei Ci‡


∗ University of Chinese Academy of Sciences
∗†‡ Institute of Software, Chinese Academy of Sciences
Beijing, China
Email: ∗ [email protected], † [email protected], ‡ [email protected]

Abstract—One critical challenge of designing an efficient requirement comes from two aspects. First, unlike a simple
prefetcher is to strike a balance between performance and memory access pattern that can be described by a constant
hardware overhead. Some state-of-the-art prefetchers achieve stride, a complex memory access pattern requires varied
very high performance at the price of a very large storage
requirement, which makes them not amenable to hardware strides for description. It can be a considerable cost in
implementations in commercial processors. recording complex patterns through bit vectors [9], delta se-
We argue that merging memory access patterns can be a quences [10], etc. Second, many features, including Program
feasible solution to reducing storage overhead while obtaining Counters (PCs), memory addresses, memory address offsets,
high performance, although no existing prefetchers, to the memory address strides, etc, and their combinations can be
best of our knowledge, have succeeded in doing so because of
the difficulty of designing an effective merging strategy. After leveraged to index patterns for improving prefetch accuracy.
analysis of a large number of patterns, we find that the address A prefetcher for high performance tends to apply fine-
offset of the first access in a certain memory region is a good grained features with large value ranges to index thousands
feature for clustering highly similar patterns. Based on this of patterns, resulting in great storage consumption.
observation, we propose a novel hardware data prefetcher, The art of designing an efficient prefetcher, dealing with
named Pattern Merging Prefetcher (PMP), which achieves
high performance at a low cost. The storage requirement for complex memory access patterns, is to strike a balance
storing patterns is largely reduced and, at the same time, the between performance and storage overhead. We find that
prefetch accuracy is guaranteed by merging similar patterns the major portion of the storage consumption of some pre-
in the training process. In the prefetching process, a strategy fetchers is caused by severe data redundancy. For instance,
based on access frequencies of prefetch candidates is applied we find 82.9% of patterns are redundant in Bingo [2], [11],
to accurately extract prefetch targets from merged patterns.
According to the experimental results on a wide range of while astonishingly 24.2% of valid entries are allocated to
various workloads, PMP outperforms the enhanced Bingo by the same pattern. The low storage efficiency due to the high
2.6% with 30× lesser storage overhead and Pythia by 8.2% data redundancy is the main reason for Bingo using a large
with 6× lesser storage overhead. table with 16, 000 entries.
Keywords-cache; hardware data prefetching; Through analyzing a large number of memory access
patterns captured from 125 traces, we observe that the
I. I NTRODUCTION patterns are highly similar, if they are indexed by the same
Prefetching is one of the well-known techniques to speed address offset of the first access (named Trigger Offset) in a
up long-latency memory accesses for decades. With the ever- certain memory region. The access is named Trigger Access
increasing memory consumption of modern applications, a in the region. Selecting trigger offsets as features offers
cache hierarchy with limited capacity can be a bottleneck a great promise of designing an efficient pattern merging
in performance improvements of processors [1]. Because strategy, because the information loss is relatively small
caches can be effectively utilized in reducing memory access when merging similar patterns. As result, a prefetcher can
latency for reusable data, a large cache might be used consume less storage and obtain high performance if the
to improve performance further. However, larger caches patterns with the same trigger offset are merged. Because the
can lead to longer cache access latency. The performance patterns indexed by a trigger offset are not necessarily iden-
improvement is limited when enlarging the capacity of high- tical, the strategy must be deliberately designed such that
level caches in the hierarchy (e.g., L1 Data Cache, L1D) characteristics of patterns can be maintained after merging,
due to their strict latency requirements. Prefetchers are often and prefetch targets can be efficiently extracted from merged
employed to reduce memory access latency by prefetching patterns.
data in need without requiring a big on-chip area, leading In this paper, we design a pattern merging strategy that
to better cost performance than enlarging caches. can efficiently quantify characteristics of patterns. Moreover,
To accurately capture memory access patterns, some state- we design a strategy that can accurately extract prefetch
of-the-art prefetchers require more and more storage in targets from merged patterns based on access frequen-
recording memory access history [2]–[8]. The large storage cies of prefetch candidates. In a nutshell, we propose a

978-1-6654-6272-3/22/$31.00 ©2022 IEEE 1012


DOI 10.1109/MICRO56248.2022.00071
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Figure 1: Pattern capturing procedure of SMS.

novel prefetcher, named Pattern Merging Prefetcher (PMP). and P + 3 have been accessed in memory region P . The
Through exhaustive experiments on 125 traces from different bit vector form is first leveraged in SMS, whose efficiency
benchmarks, PMP outperforms the enhanced Bingo by 2.6% has been proven for addressing complex memory access
with 30× lesser storage overhead and Pythia [7] by 8.2% patterns. Bulk Memory Access Prediction and Streaming
with 6× lesser storage overhead in a single-core system. (BuMP) [12] improves SMS by reducing memory energy
We make the following contributions in this paper: consumption. Bingo [2] improves SMS by using multiple
• We make a crucial observation that the memory access features, combining PCs with offsets or addresses, to accu-
patterns are highly similar if they have the same trigger rately locate patterns to be prefetched. Dual Spatial Pattern
offset, after detailed analysis of a large number of Prefetcher (DSPatch) [13] records memory access patterns
memory access patterns captured from 125 traces. with dual bit vectors, generated by an OR operation and
• We propose a novel prefetcher, named Pattern Merg- an AND operation respectively, and uses different prefetch
ing Prefetcher (PMP), including: a pattern merging policies according to the bandwidth of environment.
strategy that quantifies characteristics of patterns and The bit vector form has many advantages. First, any
reduces the storage consumption; an extraction strategy memory access distribution in a memory region can be rep-
based on access frequencies of prefetch candidates from resented with a bit vector. Second, bit vectors can represent
merged patterns for accurate prefetching; an optimiza- dozens of offsets in a memory region at a very low cost.
tion using a dual pattern table structure to provide Benefiting from the high information density of bit vectors,
multi-feature-based pattern prediction. a prefetcher can immediately generate many prefetch targets,
• We show that PMP obtains better performance at a low leading to deep prefetching. Though bit vectors do not
storage cost compared to four state-of-the-art prefet- maintain temporal information of memory accesses, i.e.,
chers over 125 traces by extensive experiments. access order, a prefetcher can apply some heuristic methods
to compensate for this drawback. For example, a prefetcher
The remaining sections are organized as follows. Sec-
can prefetch the nearest target to the currently accessed
tion II provides the background. Section III presents our
address in advance. For a prefetcher based on bit vectors,
motivations, i.e., three key observations of memory access
the performance improvement, gained by attaching temporal
patterns. Section IV describes the innovative mechanisms of
information, is not significant [14].
PMP. An exhaustive performance evaluation is presented in
Section V. Section VI discusses related work about hardware B. Pattern Capturing Framework of SMS
data prefetching with different pattern forms. Finally, the SMS adopts a lightweight pattern capturing framework,
paper is concluded in Section VII. which accounts for 2% of its total storage. It consists of
two set-associative tables: the Filter Table (FT) and the
II. BACKGROUND
Accumulation Table (AT). The FT records information of
We first briefly introduce some typical prefetchers based the first access to each memory region. The AT accumulates
on the bit vector pattern form. Next, we delve into the memory access patterns for each memory region. The pattern
bit vector pattern capturing framework of Spatial Memory capturing procedure of SMS is shown in Fig. 1. First, when
Streaming (SMS) [9], on which our prefetcher is based. the memory region of a new memory access misses in the
AT and the FT, the FT will allocate a new entry to store the
A. Prefetchers based on Bit Vectors PC and the address of the access. The offset of its address
A bit vector describes the accessed positions in a memory is a trigger offset. Second, when another access to the same
region, each bit of which corresponds to an offset in the region comes with a different offset, a bit vector is assembled
region. For example, the bit vector 1011 means that P , P +2, with the offsets of these two accesses and sent to the AT.

1013

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Figure 3: Pattern collision and pattern duplicate.

Table I: Average Pattern Collision/Duplicate Rates


Figure 2: Percentages of top 10 frequent patterns.
Pattern Pattern
Feature
Collision Rate Duplicate Rate
PC (32b) 3823.6 2.2
Third, the offsets of the following accesses that belong to the Trigger Offset (6b) 2094.2 2.6
region are used to update the bit vector in the AT. Finally, PC+Trigger Offset
269.0 6.3
(38b)
the accumulation process of the bit vector finishes when Address (48b) 1.8 556.3
any cached data belonging to this region is evicted. The bit PC+Address (80b) 1.7 608.7
vector is sent to other components such as a pattern table.

III. M OTIVATION patterns related to a feature value and the Pattern Dupli-
To analyze the characteristics of memory access patterns, cate Rate (PDR) as the number of feature values related
we use the framework of SMS to capture patterns in 125 to a pattern. Fig. 3 shows examples of pattern collisions
traces from SPEC CPU 2006 [15], SPEC CPU 2017 [16], and pattern duplicates. We say the pattern 1101 has two
PARSEC [17], and Ligra [18]. The FT is 4 × 16 set- duplicates because the feature values A and B both index
associative, the AT is 8 × 16 set-associative, and the pattern it, and the patterns 1101 and 0101 collide because they are
length is 64, matching with cachelines in 4KB pages. both indexed by the same feature value B. The PDR of
Observation 1: Only a tiny minority of memory access 1101 is 2 and the PCR related to B is 2. A greater PDR
patterns occur with high frequency. indicates higher data redundancy, since the same pattern
Because a bit vector consists of dozens of bits (64b), related to different feature values must be stored in multiple
the huge status space (264 ) might introduce a large amount entries in a classical set-associative cache. The averages of
of data that no caches can afford. Fortunately, majority the two metrics corresponding to various features are shown
patterns occur infrequently. According to our experiments, in Table I after analyzing 125 traces.
6.5×106 distinct patterns totally occur about 1.1×108 times Intuitively, prefetchers tend to use fine-grained features
among 125 traces, and 75.6% of distinct patterns appear with high bit widths, so that reduced PCRs can be obtained
only once. As shown in Fig. 2, the top 10 frequent patterns to effectively differentiate patterns. According to Table I,
account for 33.1% of the total occurrences. Note that these the 80-bit PC+Address feature has a PCR of 1.7, providing
patterns cover an extremely small portion (1.55 × 10−4 %) the highest resolution in recognizing patterns compared to
of distinct patterns. Moreover, the top 100 and the top the other listed features. However, large storage could be
1000 frequent patterns account for 57.4% and 73.8% of the wasted for storing redundant patterns, because the PDR of
total occurrences respectively. Because only a tiny minority the PC+Address feature is high (608.7). By evaluating Bingo
of patterns occur intensively, it is feasible to design a that uses the PC+Address feature, 82.9% of its patterns are
lightweight prefetcher with dozens of entries, in which only redundant at the end of simulation, and 24.2% of valid
highly frequent patterns are maintained. entries are occupied by the same pattern.
Observation 2: Indexing memory access patterns with For features containing addresses, their high PDRs indi-
fine-grained features can lead to severe data redundancy. cate that many patterns are shared among different memory
The indexing schemes used by state-of-the-art prefetchers regions. To decrease duplicate patterns caused by this reason,
[2]–[4], [9], [19] can hardly guarantee that the indexed mem- the high bits of addresses that represent regions cannot be
ory access patterns are unique in their storage. As result, used as a feature, so that the same patterns of different
a large amount of data redundancy might be introduced. regions can be indexed into the same one. Because it is
To quantitatively analyze the data redundancy of various difficult to find a feature that can eliminate the duplication,
indexing schemes using different features, we define the we try to reduce duplication with features of low PDRs.
Pattern Collision Rate (PCR) as the number of distinct A new problem appears if we attempt to leverage features

1014

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
(a) Trigger Offset-indexed pattern (b) Trigger Offset-indexed pattern
heat map for a MCF trace. heat map for an Astar trace.

Figure 4: Box plot of average ICDDs. White dotted lines


refer to means. Points refer to outliers.

(c) PC+Address-indexed pattern (d) PC-indexed pattern heat map


heat map for a MCF trace. for a MCF trace.
with limited PDRs, such as the Trigger Offset feature, the PC
feature, and the PC+Trigger Offset feature, to reduce data Figure 5: Heat maps of patterns. The x-axis represents the
redundancy. According to their high PCRs, memory access accessed offsets (from 0 to 63) in 4KB pages. The y-axis
patterns can be replaced frequently in a set-associative cache represents the indexes (from 0 to 63) of each feature. A
because thousands of distinct patterns inevitably collide with darker point means more patterns containing the offset.
each other, resulting in poor prefetch accuracy. Even if we
use an ideally large cache to save all the collided patterns, it
is still difficult to select correct prefetch targets from them. approach for merging patterns. For example, the common
As result, those features with small PDRs can hardly be memory access distribution of patterns can be kept after
employed in a naive manner to reduce data redundancy. merging with a bit-wise AND operation, and the information
Observation 3: Patterns are highly similar if they have loss caused by merging patterns with high similarity is rela-
the same trigger offset. tively small. In addition, the common distribution of similar
To reduce overhead and, at the same time, keep the patterns can be a good prefetch prediction target. Because
high performance of a prefetcher, a promising approach memory access patterns are presented as (bit) vectors, we
is to merge feature values or patterns. For example, data can use the Intracluster Centroid Diameter Distance (ICDD),
redundancy can be eliminated if feature values related to a the double average distance between all of the vectors and
pattern are merged into the same index. Challenges exist in the center of a cluster, to measure their similarity. A smaller
implementing the idea of merging feature values or patterns ICDD indicates higher similarity of patterns in a cluster. The
for a prefetcher. On one side, we suppose that the feature calculation formulas are:
values can be merged by a certain method such as an AND Σx∈S d(x, V̄ )
ICDD(S) = 2{ }.
operation. The merged feature value keeps varying during |S|
(1)
the pattern capturing process, and so does the location of 1
V̄ = Σx∈S x.
the related pattern in a cache. The requirement of frequently |S|
reallocating locations of patterns makes merging feature where S is a vector cluster, x is a vector belonging to S,
values infeasible. On the other side, we suppose that the and d(α, β) is the Euclidean distance between two vectors.
patterns related to a feature value can be merged. Because All the features have the same value range in our analysis.
the feature value is fixed, the location of the merged pattern The Trigger Offset, hashed PC, hashed PC+Trigger Offset,
in a cache is stable. But the merged pattern keeps varying hashed Address, and hashed PC+Address features all have
during the pattern capturing process. In this case, how a width of 6 bits, i.e., patterns are clustered into 64 sets
to guarantee prefetch accuracy for a prefetcher based on (clusters). The average ICDD of 64 clusters is used to
merged patterns becomes the key problem. characterize the similarity of all clustered patterns. Fig. 4
We hope to find a feature that can cluster similar patterns, illustrates the layouts of average ICDDs among 125 traces
thereby reducing the difficulty of designing an efficient for each feature. The patterns with the same trigger offset

1015

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
have the highest similarity. Trigger Offset can be a good fea- terns because the accesses to a big array can cross many
ture for implementing an efficient pattern merging method. memory regions. The patterns can also hardly be clustered
To show clustered patterns straightforwardly, we draw by the PC feature due to the different PCs of the two
patterns in several typical traces as heat maps. A point in a loops. In contrast, the trigger offsets can be a general and
heat map represents the magnitude of occurrences of patterns recognizable feature. First, the backward accesses to a big
that contain the corresponding offset. Fig. 5a is a heat map array will first load data from the end of a region, probably
that shows the representative memory access distributions resulting in the same big trigger offset for the same pattern
of a MCF trace. For most of memory accesses in the trace, in different regions. Second, three major patterns generated
the program tends to access a few positions around their by the loops can be differentiated by different big trigger
current access addresses. These frequently visited positions offsets as Fig. 5a shows. They could be mixed if the PCs
form a blue dotted slash in Fig. 5a. When certain big trigger were used as a feature.
offsets appear, the program tends to issue backward memory
IV. D ESIGN : PMP
accesses, which form three horizontal dotted lines at the
bottom. Fig. 5b shows another memory access distribution The prefetching mechanism of PMP consists of two
from an Astar trace, in which the patterns can be described processes working in parallel: the training process and the
through three slashes. This indicates that the memory ac- prefetching process. In this section, we first introduce a
cesses of Astar obey a constant stride pattern. Pattern Merging strategy in the training process, enabling
However, the representative patterns cannot be observed a substantial reduction of storage requirements without sac-
when the hashed PC+Address feature is used. Fig. 5c shows rificing performance. Second, we present a Prefetch Pattern
the memory access distributions of the MCF trace in this Extraction strategy that generates highly accurate prefetch
situation. As the figure illustrates, patterns are scattered into targets from merged patterns in the prefetching process.
all 64 sets. We can not tell the common memory access Third, we propose an optimized structure, Dual Pattern
distributions in the heat map. The common characteristics Tables, to leverage multiple features for improving prefetch
of memory access patterns are destroyed, probably resulting accuracy further. Arbitration rules are also proposed to
in inaccurate merging results. decide the final prefetch targets based on the predictions
Moreover, because the PDR of the PC feature is the of the dual pattern tables. Then, we put it together and
smallest in Table I, we would have thought that it could be summarize the main flows. Finally, we discuss the overhead.
a better choice compared to the other features. Surprisingly, A. Pattern Merging
we find the similarity of patterns indexed by PCs is lower
than the Trigger Offset feature according to Fig. 4. For The observations in Section III inspire us to build a
most traces, patterns clustered by the PC feature present storage efficient prefetcher through merging memory access
overlapped memory access distributions in each set, leading patterns clustered by trigger offsets. How to merge these
to limited recognition of patterns. In addition, PCs generally patterns is a key problem that directly determines the ef-
distribute patterns into several concentrated sets, which can fectiveness of our prefetcher. A bit-wise OR operation or
be leveraged to predict whether or not patterns can be a bit-wise AND operation could be an option for pattern
prefetched. As shown in Fig. 5d, those PC-indexed patterns merging, but the two operations are abandoned eventually.
are allocated into several sets, forming horizontal lines. The OR operation creates a superset of patterns, e.g., the
Finally, the PC feature is used to help predict prefetch levels union of pattern 1111 and any other pattern is always 1111.
as described later in Section IV-C. The AND operation generates a common subset of patterns,
Discussion. We think one major reason for Observation 3 e.g., the intersection of pattern 0000 and any other pattern
is that Trigger Offset can be a more general feature relating is always 0000. As demonstrated in the examples above, a
to memory access behaviors of programs compared to PCs few outlier samples can obscure the differences in memory
and addresses. For example, the backward memory accesses, access patterns completely, leading to inaccurate records.
as shown at the bottom of Fig. 5a, can be generated by the Based on extensive studies, we choose to merge bit vector
two loops of the following codes in MCF [20]: patterns by counting the number of occurrences of each
offset in bit vectors, instead of the two operations mentioned
// File: pflowup.c above. To achieve this goal, we apply a vector of counters
// Method: MCF_primal_update_flow (named Counter Vector), in which each element records
// data are stored in a big array the number of accesses to an offset, to merge patterns in
for(;iplus!=w;iplus=iplus->pred)
{...} a cluster. In the training process of PMP, the counters for
for(;jplus!=w;jplus=jplus->pred) offsets in a vector increase in parallel if the corresponding
{...} offsets are accessed in a bit vector pattern to be merged.
Fig. 6a illustrates an example of merging the bit vec-
Features containing addresses fail to cluster the same pat- tor (0, 1, 1, 0, 1, 0, 0, 0), captured from the access sequence

1016

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Figure 6: Architecture of PMP.

P + 2, P + 1, P + 4 in memory region P , into the counter to form a Prefetch Pattern, which is a vector of target cache
vector (3, 0, 3, 0, 3, 0, 0, 0). First, the bit vector needs shift- levels for each offset. We consider three different schemes.
ing into an anchored bit vector. The bit vector is converted Access-Number-based Extraction. A prefetch target for
into (1, 0, 1, 0, 0, 0, 0, 1) by left circular shifting 2 positions an offset is generated if the corresponding element of the
because the trigger offset is 2. Then, the first, the third, triggered counter vector is equal to or greater than a prefetch
and the last elements of the counter vector increase by threshold. We call this scheme Access-Number-based Ex-
1 respectively. Finally, the counter vector is updated to traction (ANE). For example, the counter vector (4, 2, 0, 1)
(4, 0, 4, 0, 3, 0, 0, 1). Note that the counter corresponding to can be converted to the prefetch pattern (0, L1, 0, L1) if the
the trigger offset will always be the first element of a shifted prefetch threshold for L1D is 1, then A+1 and A+3 are the
vector and increases in every merging operation, so it is prefetch targets for the current cacheline address A. Please
called Time Counter. note that the trigger offset (the first offset after shifting) itself
Because all the patterns are merged, old records cannot will never be prefetched. This scheme is easy to implement,
be simply evicted during the training process. Instead, all but the obvious drawback is that any target offset needs an
the elements in a counter vector are halved when the time inevitable cold start time to reach the prefetch threshold. For
counter saturates. This mechanism aims at reducing but a threshold T , an offset will not be prefetched until it has
keeping the effects of old records in the prefetching process. been visited T times, losing many prefetch chances.
Supposing that the maximum value of time counters is 3, Access-Ratio-based Extraction. For every element of
the counter vector (4, 0, 4, 0, 3, 0, 0, 1) is saturated in the the triggered counter vector, a ratio of it to the sum of
example above, then it is halved to (2, 0, 2, 0, 1, 0, 0, 0). all counters can be compared to a threshold for generating
The pattern merging strategy is efficient for two reasons. prefetch targets. We briefly call this scheme Access-Ratio-
First, because the patterns to be merged have high similarity, based Extraction (ARE). For example, given a prefetch
their common characteristics remain after merging, which threshold 1/4 for L1D, the ratios of each element in the
lays the groundwork for high prediction accuracy in the counter vector (4, 2, 0, 1) are (?, 2/3, 0, 1/3). Because the
prefetching process. Otherwise, the merged patterns would trigger offset is excluded, its ratio is ignored. Then, the
be inevitably ambiguous for prefetching, if the patterns being counter vector can be converted to the prefetch pattern
merged were irrelevant. Second, unlike the OR/AND opera- (0, L1, 0, L1). Then, A+1 and A+3 are the prefetch targets
tion, which accepts/rejects all the differences in patterns, our for the current cacheline address A. Though a counter vector
strategy quantifies characteristics of patterns and reduces the is halved when the time counter saturates, the ratios can
storage consumption. The statistical results after merging hardly be changed.1 The ARE avoids retraining after halving
can be leveraged for precisely predicting prefetch targets, if the pattern of following memory accesses does not vary,
which is described in the next subsection. so the prefetching process continues. However, the ARE
B. Prefetch Pattern Extraction implicitly limits the maximum number of prefetches in one
prediction, namely the prefetch depth. Because a prefetch
In the prefetching process, stored patterns can be triggered
target on an offset is generated only if its corresponding ratio
when a trigger access comes. Triggered patterns in the
is equal to or greater than a threshold, at most d offsets can
form of counter vectors cannot be simply replayed for
prefetching like prior prefetchers do. Therefore, a conversion
1 The values of counters in a counter vector decrease to half, so does
from triggered patterns to prefetch targets is required. Our
their sum. Therefore, the ratios will not change theoretically. However, the
strategy is to individually select target offsets by examining calculation results may be slightly affected because of precision limitations
the corresponding elements in the triggered counter vector in hardware.

1017

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
be prefetched for the prefetch threshold 1/d. For a triggered degree for PMP. When free Prefetch Queue (PQ) entries
vector containing 64 counters with the same value, e.g., a exist, PMP first assembles addresses using valid offsets in
stream pattern, no addresses can be prefetched unless the the prefetch pattern that are near the cacheline address of the
threshold is lower than 1/63 (exclude the trigger offset). trigger access and issues them to the corresponding cache
But a low threshold that is smaller than 1/63 may lead to levels. If PQ is full, the prefetching process is suspended.
inaccurate prefetching in most cases. The ARE introduces When any load with the address of the same region reappears
an unnecessary trade-off between the prefetch depth and the and free PQ entries exist, the process continues with the
prefetch accuracy. prefetch pattern in the PB. Please note that prefetch requests
Access-Frequency-based Extraction. Access frequen- are prohibited from occupying all MSHRs, at least one
cies of offsets can be a good criterion for prefetch target MSHR is remained for normal load/store requests.
selection. An access frequency is different from an access
ratio that the former indicates how many times an offset C. Multi-Feature-based Prediction
occurs in a period, and it is not influenced by the occurrences Based on Observation 3, we notice that the PC feature
of other offsets. Moreover, the access frequency of an offset of the trigger access can also be leveraged to help predict
can be easily calculated by dividing its counter by the prefetch targets. The combination of PCs and trigger offsets
time counter. The Access-Frequency-based Extraction (AFE) can be used as a feature like prior research does, but it
generates prefetch targets by comparing access frequencies may not be the best option. First, the concatenated feature
of offsets to a prefetch threshold. For example, given an L1D expands the index range greatly, necessitating a large table
prefetch threshold 1/4, the frequencies of each counter in the to store merged patterns. Second, patterns clustered by the
counter vector (4, 2, 0, 1) are (?, 2/4, 0, 1/4), so the counter combination of PCs and trigger offsets have lower similarity.
vector can be converted to the prefetch pattern (0, L1, 0, L1), The memory access patterns indexed by trigger offsets are
then A + 1 and A + 3 are the prefetch targets for the current separated into different sets again by PCs, bringing a greater
cacheline address A. This scheme has several advantages. divergence of patterns in a set than using the Trigger Offset
First, because the halving mechanism hardly affects the feature or the PC feature alone.
frequencies of offsets, the AFE avoids the extra retraining Dual Pattern Tables. To enable multi-feature-based pre-
process when the following memory access patterns do not diction and reduce the divergence of memory access patterns
change. More importantly, the AFE has better adaptability in a cluster, dual pattern tables are applied to maintain
to different kinds of patterns compared to the former two merged patterns indexed by trigger offsets and PCs respec-
schemes. On one side, the AFE does not have the cold tively as Fig. 6c illustrates. Because counter vectors collect
start problem. If an offset appears every time in the last all the patterns indexed by the same feature value without
T patterns, its frequency is 100% from the beginning of evicting, the vectors can be stored in tagless direct-mapped
training which exceeds any threshold. This is friendly for tables. The Offset Pattern Table (OPT) indexed with trigger
patterns with a few repetitions. On the other side, no implicit offsets is the primary pattern table, and the PC Pattern
restrictions are introduced by the AFE, since every offset Table (PPT) indexed with PCs is the supplement pattern
that frequently occurs can be independently selected. This table. During the training process, the dual pattern tables
is friendly for stream-like patterns. Finally, we use the AFE are updated simultaneously. In the prefetching process, the
as the default prefetch pattern extraction scheme. candidate prefetch patterns are independently predicted by
To reduce the cache pollution in high-level caches without the two tables when a trigger access comes. Two candidate
losing prefetch chances, PMP prefetches data into differ- prefetch patterns are given by the tables using the AFE
ent cache levels depending on various thresholds. In the described in the last subsection.
default configuration using the AFE, the confidence refers Arbitration. An arbiter is applied to decide the final
to frequencies. The threshold Tl1d for prefetching data into prefetch pattern depending on the predictions from the two
L1D is 50% and the threshold Tl2c for L2 Cache (L2C) tables. Though both pattern tables can give prefetch targets
is 15%. The target addresses, assembled using the current individually, it is better to discard the targets given by the
cacheline address and offsets with confidence greater than PPT that are not included in the targets given by the OPT.
or equal to Tl1d , are prefetched to L1D. The targets with The predictions of the OPT are more accurate than the PPT
confidence greater than or equal to Tl2c but less than Tl1d are according to the experimental result in Section V-E3. As
prefetched into L2C for reducing the risk of cache pollution shown in Fig. 6e, the arbitration rules are as follows: 1)
in L1D. Fig. 6b shows an example that the prefetch pattern prefetches aiming at the same target offset are issued to
is (0, 0, L1, 0, L1, 0, 0, L2) extracted from the counter vector L1D only if both pattern tables predict to prefetch data into
(4, 0, 4, 0, 3, 0, 0, 1). L1D; 2) if the same target offset is predicted by both tables in
As shown at the bottom of Fig. 6c, a new prefetch pattern which one table predicts to prefetch data into L2C, the target
is stored in the Prefetch Buffer (PB) and indexed by the re- will be prefetched into L2C; 3) if the PPT has no predictions,
gion address of the trigger access. There is no fixed prefetch the cache level of prefetches predicted by the OPT will be

1018

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Table III: Detailed Storage Overhead
Structure Entry Field Storage
Region Tag (33b)
Hashed PC (5b)
Filter Table 8×8 376Bytes
Trigger offset (6b)
LRU (3b)
Region Tag (35b)
Hashed PC (5b)
Figure 7: Training and prefetching flows of PMP. Accumulation
2 × 16 Trigger offset (6b) 456Bytes
Table
Bit Vector (64b)
LRU (4b)
Table II: Preset Parameters Offset Pattern
64 × 1
Counter
2560Bytes
Table Vector (320b)
Parameter Value PC Pattern Coarse Counter
OPT Counter Size 5b 32 × 1 640Bytes
Table Vector (160b)
PPT Counter Size 5b Region Tag (36b)
OPT Pattern Length 64 Prefetch Prefetch
PPT Pattern Length 32 1 × 16 332Bytes
Buffer Pattern (126 bits)
Region Size 4KB LRU (4b)
Monitoring Range 2 T otal ≈ 4.3KB
L1D Prefetch Threshold (Tl1d ) 50%
L2C Prefetch Threshold (Tl2c ) 15%

E. Overhead
Table II lists all the preset parameters of PMP. Table III
downgraded (e.g., L2C to Last Level Cache, LLC); 4) no lists the overhead details. In our default configuration, the
prefetches will be issued if there are no predictions from OPT indexed with trigger offsets contains 64 entries, and
the OPT. Please note that the final prefetch pattern after the PPT indexed with PCs has 32 entries. The FT and the
arbitration is stored in the PB. AT have 64 and 32 entries respectively. The PB can store
Coarse Counter Vector. Because the predictions from 16 prefetch patterns. Two bits are enough for four states of
the PPT only affect prefetch cache levels, the storage costs every offset: No Prefetch, Prefetch to L1D, Prefetch to L2C,
can be reduced further by monitoring several offsets with and Prefetch to LLC. Therefore, a prefetch pattern requires
one counter. The number of offsets monitored by a counter 126 bits for 63 targets. The size of memory regions that each
is called Monitoring Range. As Fig. 6d depicts, the union pattern corresponds to is 4KB. Finally, the total hardware
of every several near bits of a bit vector is counted by overhead of PMP is 4.3KB (30× lower than Bingo).
a shorter counter vector, named Coarse Counter Vector. We use CACTI [21] with its 22nm configuration to
The 8-bit vector 10100001 is reduced to 1101 by joining estimate the area consumption and the cache access time
every two bits, then 1101 is merged into the coarse counter of our design. The area of the dual pattern table structure
vector (3, 1, 0, 1). Every element in a coarse counter vector is 0.0069 mm2 . It is 151× smaller than the large set-
controls whether/where to prefetch on adjacent offsets at the associative pattern table in Bingo, which costs 1.0372 mm2 .
prefetch level arbitration step. The final prefetch pattern in The total access time (input and output) of the dual pattern
the example of Fig. 6 is (0, 0, L1, 0, L2, 0, 0, L2) based on table structure is 0.1ns, which is 11× shorter than the access
the two candidate prefetch patterns: (0, 0, L1, 0, L1, 0, 0, L2) time of Bingo’s pattern table.
from the OPT and (0, L1, 0, L2) from the PPT.
V. E VALUATION
A. Methodology
D. Putting It Together
1) Configuration: We use the ChampSim [22] simulator
Fig. 7 shows the training and the prefetching flows of to evaluate prefetchers. Table IV illustrates the system
PMP. The training process performs on L1D loads. If the configuration of ChampSim. We compare PMP to four
region of an L1D load misses in the AT and the FT, it is a state-of-the-art prefetchers: DSPatch [13], Bingo [2], [11],
trigger access in the region. The pattern capturing framework SPP+PPF [4], [23], and Pythia [7]. For a fair comparison,
records the following accesses in the region as described in all prefetchers are placed at L1D, and no helper prefetchers
Section II-B and the captured pattern is merged in the OPT exist in the other cache levels. That is, the five prefetchers are
and the PPT. The prefetching process performs at the same all single-level in our evaluation. DSPatch is a lightweight
time as the trigger access comes. The offset and the PC of bit-vector-based prefetcher that records patterns through an
the trigger access are used to trigger (index) the OPT and OR and an AND operations. The competition version of
the PPT respectively. The final prefetch pattern is obtained Bingo in the third Data Prefetching Championships (DPC)
after the extraction and the arbitration, and it is then stored [11] is the most powerful single-level prefetcher according to
in the PB for prefetching. our experiments with a couple of state-of-the-art prefetchers.

1019

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Table IV: Simulated System Configuration Table VI: Traces
Core One to four cores, 4GHz, 4-wide, 352-entry SPEC 06 SPEC 17 Ligra PARSEC Total
ROB, 128-entry LQ, 72-entry SQ, 4KB Page 38 36 42 9 125
TLBs 64-entry ITLB, 64-entry DTLB, 1536-entry
L2DTLB
L1I 32KB, 8-way, 32-entry PQ, 8-entry MSHR, Table VII: Heterogeneous 4-core Workloads
4 cycles
MIX #
L1D 48KB, 12-way, 8-entry PQ, 16-entry MSHR,
All Low MPKI 10
5 cycles
All Medium MPKI 10
L2C 512KB, 8-way, 16-entry PQ, 32-entry MSHR,
All High MPKI 10
10 cycles
Half Low and Half Medium MPKI 10
LLC 2MB to 8MB, 16-way, 32 to 128-entry PQ,
Half Low and Half High MPKI 10
64 to 256-entry MSHR, 20 cycles, Inclusive
Half Medium and Half High MPKI 10
DRAM 4GB 1 channel (1-core), 8GB 2
channels (4-core), 3200 MT/sec

Table V: Prefetcher Overhead storage cost. PMP improves the performance of the non-
prefetching baseline by 65.2% and outperforms DSPatch,
DSPatch Bingo SPP+PPF Pythia PMP
3.6KB 127.8KB 48.4KB 25.5KB 4.3KB Bingo, SPP+PPF and Pythia by 41.3% (up to 177.4%),
2.6% (up to 62.4%), 6.5% (up to 59.2%) and 8.2% (up to
183.1%). DSPatch requires the lowest storage among the
We enhance it by doubling the size of its pattern table to five prefetchers. However, the low performance of DSPatch
match its original configuration [2]. SPP+PPF is a strong indicates that its pattern capturing strategies, an OR and an
competitor in DPC-3 [23], leveraging nine different features. AND operations, are inefficient. Bingo is a heavy prefetcher
Pythia is a new type of prefetcher built on machine learning that is 3× larger than a typical L1D, so it is more realistic
in hardware. Table V shows the storage overhead of these for it to be placed at low-level caches, which brings lower
prefetchers. performance. PMP (at L1) outperforms the original Bingo
2) Benchmarks: Over one hundred traces, captured from at LLC by 16.5%. SPP+PPF leverages nine features for
SPEC CPU 2006 [15], SPEC CPU 2017 [16], PARSEC filtering predictions generated by an aggressive Signature
[17], and Ligra [18], are used to evaluate the single-core Path Prefetcher (SPP) [10], which performs much better than
performance of the five prefetchers. For SPEC CPU 2006 the original SPP. Though SPP+PPF applies more features
and SPEC CPU 2017, we use the instruction traces provided compared to Bingo and PMP, its performance is lower,
by DPC-2 and DPC-3 [24], [25]. For PARSEC and Ligra, which indicates that the number of features is not the
we use the traces provided by Pythia [7]. All the traces have more the better. In contrast to PMP that can issue dozens
more than five LLC misses per kilo instructions (MPKI). of prefetches in one prediction, Pythia only generates one
Table VI lists the numbers of traces for each benchmark. prefetch target per prediction, which limits its prefetch depth
We evaluate the multi-core performance of the five and performance. Pythia cannot deeply prefetch due to its
prefetchers with homogeneous and heterogeneous multi- poor prefetch accuracy, which will be described in the next
programmed workloads in a 4-core processor. For homoge- subsection.
neous workloads, we examine prefetchers with 125 traces, On the desktop or scientific workloads such as SPEC CPU
each of which is simultaneously performed by different 2006 and SPEC CPU 2017, the performance of PMP is better
cores. For heterogeneous workloads, we first classify traces than that on the other workloads, since the memory access
into three classes: Low MPKI (5 < MPKI ≤ 10), Medium patterns of these workloads are regular. Though prefetchers
MPKI (10 < MPKI ≤ 20), and High MPKI (MPKI > 20). usually require large hardware storage to deal with irregular
Then, we randomize traces according to their classes and memory accesses generated by Ligra and PARSEC, PMP
generate workloads as Table VII shows. We also examine still outperforms the heavy prefetchers (Bingo, SPP+PPF
prefetchers with CloudSuite traces [26] but do not see many and Pythia) at a lower cost on these traces.
performance improvements, so the results are omitted.
C. Coverage & Accuracy in Single-core
In both single-core and multi-core systems, we use the
first 50 million instructions of a trace to warm up micro- Prefetch coverage and prefetch accuracy are two metrics
architectural structures and the next 200 million instructions that explain a lot about performance of prefetchers. We
to examine the performance of each core. define the coverage as the ratio of reduced load misses to
the total load misses of the baseline, and the accuracy as the
B. Single-core Performance ratio of useful prefetches to the sum of useful and useless
Fig. 8 illustrates the normalized IPCs (NIPCs) of the prefetches. PMP, SPP+PPF, and Bingo prefetch data into
five prefetchers in the single-core system. We observe that multiple levels of caches directly: PMP can fill data into
PMP outperforms the other four prefetchers at a very low L1D, L2C, and LLC; SPP+PPF and Bingo can fill data into

1020

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Figure 8: Single-core performance of five prefetchers. Figure 10: Average useful and useless prefetches.

useless prefetches hurt performance. Bingo issues many


useful prefetches while producing the least amount of useless
prefetches in L1D, benefiting from applying the fine-grained
feature (PC+Address) of the greatest pattern differentiation
ability among the five prefetchers as Table I shows. However,
Figure 9: Coverage and accuracy of five prefetchers. it also requires the largest storage space. Second, because
PMP has a low L2C threshold (15%) and its first arbitration
rule for prefetching data into L1D is strict, it speculatively
prefetches data into low-level caches. Benefiting from the
L1D and L2C based on their configuration. We examine
accurate prefetching strategy, the numbers of useful L2C
coverage and accuracy on different cache levels separately.
and LLC prefetches of PMP are much larger than the
Fig. 9 shows the coverage and the accuracy of these pre- other four prefetchers. Since L2C and LLC have larger
fetchers. We observe that PMP obtains the highest L2C cov- storage space and are not latency-sensitive compared to
erage and the highest LLC coverage among five prefetchers. L1D, it is affordable for them to contain more useless
Its L1D coverage is much higher than DSPatch by 121.1%, prefetches. The useless L2C/LLC prefetches of PMP do not
which is competitive compared to the other prefetchers. hurt performance much.
The L1D accuracy of PMP exceeds DSPatch, SPP+PPF and
Pythia by 52.8%, 11.4% and 29.2%. The L2C accuracy of D. Memory Traffic in Single-core
PMP exceeds DSPatch, Bingo, SPP+PPF, and Pythia by Prefetchers generally generate many memory access re-
118.1%, 66.8%, 302.3%, and 243.7%. The high accuracy quests that consume additional memory bandwidths. To
indicates that the prediction strategy of PMP is accurate. evaluate the memory traffic of the five prefetchers, we
Since the Trigger Offset and PC features allow patterns to calculate the ratio of total memory access requests to the
be shared between different memory regions, prefetched data requests of the non-prefetching baseline as the Normalized
can be demanded even if it is not ever accessed before in Memory Traffic (NMT). Through evaluation, the NMTs of
a certain memory region, reducing compulsory misses. As SPP+PPF and Pythia are 129.0% and 139.1%. Because
result, the coverage of PMP is also high. In a nutshell, the triggered patterns in the bit vector form can generate dozens
high coverage and the high accuracy contribute much to the of prefetch targets in every prediction, the NMTs of DSPatch
high performance of PMP. and Bingo are higher than the former two prefetchers, which
The results show that the L2C accuracy and the LLC are 159.8% and 164.2% respectively.
accuracy are much lower than the L1D accuracy for all the PMP produces the highest NMT (199.6%). It might be
five prefetchers. This is because all the prefetchers train on because PMP has the most aggressive prefetch policy com-
L1D accesses. In the case of inclusive caches, prefetches for pared to the other four prefetchers. First, counter vectors
high-level caches will implicitly prefetch data to low-level can generate up to 63 prefetch targets in one prediction, and
caches for keeping inclusiveness. Since many load/store prefetches will be issued if free PQ entries exist. Second,
requests are invisible to low-level caches, the corresponding the prefetch condition of PMP is easy to meet because there
prefetch accuracy is lower. are no tags for matching. After training PMP for a brief
Fig. 10 shows the average valid prefetches that break time, each trigger offset can issue a bundle of prefetches.
down into useful and useless prefetches in different cache PMP issues 1.46×107 prefetches per trace, which are 58.0%
levels. We observe that PMP obtains high performance by more than Bingo. This aggressive prefetch policy provides
suppressing the cache pollution in L1D and prefetching the deeper speculation through prefetching. Benefiting from
speculatively for low-level caches. First, PMP effectively the accurate prediction strategy, more useful prefetches are
restricts useless prefetches in the storage-limited L1D, while generated, so that PMP obtains the higher L2C/LLC prefetch
still offering many useful prefetches. DSPatch, SPP+PPF, coverage compared to other four prefetchers at the price
and Pythia fail to limit the cache pollution in L1D com- of more memory traffic. Considering the high memory
pared to PMP. Though SPP+PPF and Pythia generate more bandwidth consumption, high frequency memory (DDR4
useful prefetches than PMP, their large numbers of L1D [27], DDR5 [28], HBM [29], etc.) is appealing to PMP.

1021

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Table IX: Performance and Overhead of PMP Under Differ-
ent Pattern Lengths
Pattern Region
Overhead NIPC
Length Size
64 4KB 4.3KB 1.652
Figure 11: Main structure of Design B. 32 2KB 2.5KB 1.626
16 1KB 1.6KB 1.572

Table VIII: Performance of Design B


Table X: Performance of PMP Under Different Trigger
Ways 8 32 128 512 Offset Widths and Counter Sizes
NIPC 1.176 1.188 1.215 1.224
Trigger Counter
NIPC NIPC
Offset Width (b) Size (b)
6 1.652 2 1.624
To limit the NMT of PMP without introducing extra 7 1.654 3 1.634
overhead, a straightforward approach is to control the ag- 8 1.655 4 1.648
9 1.657 5 1.652
gressiveness of prefetches for low-level caches (L2C and 10 1.657 6 1.653
LLC). When the prefetch degree for low-level caches is 1, 11 1.657 7 1.654
PMP still outperforms Bingo by 1.3% with a lower NMT 12 1.658 8 1.655
(159.0%).

E. Performance Analysis in Single-core nities are lost. The performance of the ARE is very low on
1) Pattern Merging Strategy: In Section V-B, the compar- many traces that contain a large number of stream patterns.
ison between PMP and DSPatch has indicated that it can be Furthermore, PMP obtains no performance improvements
inefficient to merge patterns by an OR or an AND operation. by tuning the thresholds of the ARE. Though the similar
Other than the merging strategy of PMP, it could be possible methods of the ARE work well in many prefetchers [4],
to obtain low storage consumption and good performance by [10], [30], the ARE is not suitable for our prefetcher.
merging the identical patterns only, considering spatial and Because the ANE has the cold start time problem, and
temporal locality of workloads. A bit vector attached with a the prefetching process could be interrupted by the halving
counter can be used to count repetitions of identical patterns, mechanism, the performance of the ANE is slightly im-
in which the counter can be used for target pattern selection, pacted. PMP using the ANE achieves a 60.3% improvement
e.g., the ANE scheme can be used to determine the prefetch beyond the baseline, which is 2.9% lower than the AFE. The
pattern according to the counter. Instead of individually ANE can be another feasible scheme with reduced hardware
selecting each offset, all the valid offsets in the bit vector complexity compared to the AFE.
with a counter that exceeds the threshold are the prefetch 3) Multi-Feature-based Prediction: We compare the per-
targets. The enhanced bit vectors can be stored in a set- formance of the dual pattern table structure to the single
associative cache indexed with trigger offsets. This design table indexed with the combination of PCs and trigger
is named Design B, whose main structure is illustrated in offsets. Because PMP uses 6-bit trigger offsets and 5-bit
Fig. 11. We compare Design B to the pattern merging PCs, the number of entries storing patterns increases from
strategy of PMP. Table VIII depicts that the performance 96 (26 + 25 ) to 2048 (26+5 ) when the combined feature is
of Design B grows as the associativity increases. However, used. However, the performance degrades by 3.1% compared
PMP outperforms Design B with 512 ways by 34.9%. Our to the dual pattern table structure.
pattern merging strategy is more efficient and eliminates the We also evaluate the performance of PMP with a single
impact of a large number of evictions. Bingo does not suffer feature. The performance reduces by 2.4% when using a
from severe evictions because of using features with low single OPT. Moreover, we attempt to use a single PPT with
PCRs. the same size as the OPT, and the performance reduces by
2) Prefetch Pattern Extraction Schemes: We consider 3.5%. We find that the single-feature-based PMP prefetches
three schemes for the prefetch pattern extraction in Sec- more data into higher-level caches than the multi-feature-
tion IV-B. For the ANE, the Tl1d is 16 and the Tl2c is 5 based PMP because of a lack of the prefetch level arbitration.
to scale close to the default thresholds of the AFE. The As result, more useless prefetches are produced. For exam-
thresholds are the same between the ARE and the AFE. ple, PMP with the single PPT generates 38.3% more useless
Comparing to the AFE, the ARE performs poorly, which prefetches at L1D compared to the default PMP. The large
improves the baseline by 5.0%. We find that the poor number of useless prefetches at L1D hurt its performance.
adaptability of the ARE for prefetching severely limits the 4) Preset Parameters: Pattern Length. The length of
prefetch coverage. Because stream patterns can hardly be counter vectors depends on the size of tracked memory
extracted by the ARE, a large number of prefetch opportu- regions. Given a 4KB memory region, the accesses to each

1022

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Table XI: Performance of PMP Under Different Pattern
Monitoring Ranges
Monitoring
1 2 4 8
Range
NIPC 1.650 1.652 1.630 1.615

cacheline inside the region can be tracked using a vector


Figure 12: Performance of five prefetchers under different
containing 64 counters. PMP does not support cross-page
memory bandwidths and LLC sizes.
prefetching, so we only discuss memory regions smaller
than 4KB pages. As Table IX shows, we evaluate the
performance of PMP with different lengths of patterns at
different scales (i.e., in 4KB, 2KB, and 1KB regions),
namely PMP-64, PMP-32, and PMP-16. The results show
that the performance degrades as the pattern length becomes
shorter. Patterns can be folded in the case of shrinking the
tracked memory regions, which impacts the statistical results
of pattern merging, so that the accuracy of extracting the Figure 13: Multi-core performance.
prefetch patterns is lower. According to our analysis, the
L2C prefetch accuracy of PMP-32 decreases from 15.2%
to 10.6% compared to PMP-64, and the L2C accuracy several offsets with a counter. According to Table XI, the
decreases to 8.0% after reducing the pattern length to 16. monitoring range 2 can be a good choice for reducing
Though the performance decreases, the overall overhead overhead while maintaining high performance. When the
is greatly reduced when using short patterns. PMP-16 is still monitoring range is 8, the performance is almost the same
competitive in terms of performance improvements, which as using a single OPT, which makes the dual pattern table
outperforms DSPatch by 34.5% at a 2× lesser storage cost. structure useless.
PMP-32 outperforms Bingo by 1.0% at a 51× lesser storage F. Sensitivity in Single-core
cost. We attempt to adjust the size of tracked memory
1) Bandwidth: We examine the five prefetchers under
regions of Bingo from 2KB to 4KB. But this adjustment
various memory bandwidths as Fig. 12a depicts. PMP per-
does not affect the accuracy of Bingo because its prefetch
forms better than the other prefetchers when the bandwidth
accuracy relies on the fine-grained features.
is greater than or equal to 1600 MT/s. When the bandwidth
Trigger Offset Width. Table X shows the NIPCs of PMP
reaches 3200 MT/s, the performance is close to peak. PMP
under different trigger offset widths. As the results show,
slightly underperforms Bingo, SPP+PPF and Pythia at the
the performance increases with the width of trigger offsets.
800 MT/s bandwidth due to the greater bandwidth require-
To avoid hash collisions, the sizes of direct-mapped tables
ments, but it still outperforms DSPatch, applying various
are equal to the value ranges of features. In this case, the
prefetch policies based on bandwidths, by 18.2%. Because
storage overhead increases exponentially with the width of
the number of prefetches hardly varies under different mem-
trigger offsets. Comparing to the default PMP, the overhead
ory bandwidths, the performance of PMP is limited when the
increases by 64× using 12-bit trigger offsets while the NIPC
bandwidth is low.
increases by only 0.4%. Since the gain of performance
2) LLC Size: Fig. 12b shows the performance change of
improvement is negligible, we finally choose to use 6-bit
the five prefetchers under different LLC sizes. We enlarge
trigger offsets.
the LLC by increasing the number of LLC sets. PMP
Counter Size. The counter sizes of the OPT and the PPT outperforms the other four prefetchers in different LLC
can be different. We set the counter size of the PPT to 5 bits sizes. With the 8MB LLC setting, PMP outperforms Bingo
and only modify the counter size of the OPT. As described by 3.3%. The performance gap between PMP and Bingo
in Section IV-A, by leveraging the halving mechanism to becomes larger when the LLC size increases because the
reduce the effects of old records, a larger counter size allows impact of cache pollution caused by useless prefetches is
maintaining history for a longer period. Table X illustrates reduced, which is friendly for aggressive prefetching.
the performance of PMP under different counter sizes. With
the increase of counter sizes, the performance can increase G. Multi-core Performance
accordingly. The results show that the extraction strategy can Fig. 13 shows the multi-core performance of the prefet-
make better predictions with longer history. chers. PMP-Limit is the PMP whose prefetch degree for low-
Monitoring Range. As mentioned in Section IV-C, we level caches is 1. PMP outperforms DSPatch by 39.6%, and
reduce the storage overhead of the PPT by monitoring the two heavy prefetchers SPP+PPF and Pythia by 7.3% and

1023

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
6.9% respectively. PMP can match the performance of Bingo in a reduction of data redundancy. Moreover, no additional
on homogeneous and heterogeneous workloads. Because the features are required for indexing the patterns described
aggressive prefetching strategy consumes a larger amount of through delta sequences. However, prefetchers relying on
bandwidth resources in the 4-core system, the performance delta sequences are not accurate enough. To address this
improvement is slightly reduced compared to PMP in the obstacle, Perceptron-based Prefetch Filtering (SPP+PPF) [4]
single-core system. After limiting prefetching, PMP-Limit designs a heavy perceptron-based filter with nine features to
obtains 1% higher performance compared to Bingo. improve the accuracy of SPP. Moreover, prefetchers based
on delta sequences have to yield prefetch addresses step by
VI. R ELATED W ORK step using a recursive look-ahead strategy [4], [5], [10], [30].
A. Prefetchers based on Constant Strides It is difficult for prefetchers, relying on delta sequences, to
Some memory access patterns can be described using issue dozens of prefetches immediately like Bingo does for
constant strides. A stride is the difference between two deep prefetching.
continuous access addresses. This is a straightforward de-
scription of memory access patterns, which is widely used C. Other Hardware Data Prefetchers
by prefetchers. For example, Next Line Prefetcher (NL) [31] Some other prefetchers pay attention to irregular mem-
can be seen as a prefetcher that always prefetches data ory access patterns such as Global History Buffer (GHB)
by one cacheline stride. Constant-stride-based prefetchers [34], Irregular Stream Buffer (ISB) [35], Managed Irregular
allow different strides for different memory regions [32]. Stream Buffer (MISB) [6], and Triage [8]. Because irregular
Best Offset Prefetcher (BOP) [3] periodically calculates the memory access patterns usually cross many pages, which
confidence of different strides and chooses the best stride are difficult to be described through pattern forms such as
for prefetching. Sandbox Prefetcher (SP) [19] applies a constant strides, delta sequences, or bit vectors. GHB uses
similar method. The difference between BOP and SP is a large circular history buffer to record access history. ISB
that the former records real memory accesses for calculating reconstructs physical addresses into structural addresses and
confidence and the latter uses a bloom filter to record fake stores them in memory. MISB improves ISB by filtering
prefetches instead. Though prefetchers built on constant unnecessary memory requests with bloom filters. Triage or-
strides can improve performance at very low storage costs, ganizes patterns as key-value pairs of addresses and requires
it can be hard to predict complex patterns. Because constant up to the half storage of a LLC for eliminating requirements
strides only assume that future accesses arrive at a certain of off-chip storage. Most of them require too much storage
pace, the patterns that consist of variable strides, like the and are only effective in irregular situations, which makes
address sequence (P + 1, P + 2, P + 4, P + 3, P + 1) in the their design unaffordable in general processors.
memory region P , are beyond their capabilities.
VII. C ONCLUSION
B. Prefetchers based on Delta Sequences
Delta sequences use several deltas to describe the variation In this paper, we design a low-overhead and powerful
of strides for access sequences, e.g., the delta sequence for prefetcher, named Pattern Merging Prefetcher (PMP). By
the address sequence (P +1, P +2, P +4, P +3, P +1) in the analyzing many features, we find that the Trigger Offset fea-
memory region P is (1, 2, −1, −2). Different from the other ture can be used to cluster memory access patterns with high
pattern forms that require addition information for indexing, similarity. By merging similar patterns, the storage overhead
delta sequences of memory accesses can be utilized to index of our prefetcher is largely reduced, and an accurate prefetch
themselves, i.e., the prefixes of delta sequences can be used pattern extraction strategy with various effective schemes
as a feature. For example, if the prior access history forms a is applied. Moreover, an optimized prediction mechanism
delta sequence (1, 2, −1), we can predict the next delta is −2 based on multiple features is proposed. PMP outperforms
in the example above. Signature Path Prefetcher (SPP) [10] the non-prefetching system by 65.2%, and exceeds the
captures patterns using five-delta sequences. The first four performance of enhanced Bingo by 2.6% with 30× lesser
deltas of a sequence are compressed as a signature to trigger storage overhead and Pythia by 8.2% with 6× lesser storage
the last delta. Variable Length Delta Prefetcher (VLDP) overhead in a single-core system.
[33] uses separated tables to maintain delta sequences with We believe that there is still room for further exploration
different lengths. Matryoshka [30] supports variable-length in the design idea of pattern merging. In the future, we plan
sequence matching by coalescing variable-length sequences to do further research on fundamental reasons behind the
into a single table. three observations and try to apply the design idea to other
Delta sequences can record the spatial and temporal predictors in processors for improving performance.
information of memory access patterns. An advantage of
ACKNOWLEDGMENTS
recording memory access patterns through delta sequences is
that the common parts of sequences can be shared, resulting We thank the anonymous reviewers for their feedback.

1024

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [12] S. Volos, J. Picorel, B. Falsafi, and B. Grot, “Bump: Bulk
memory access prediction and streaming,” in 2014 47th
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Annual IEEE/ACM International Symposium on Microarchi-
implications of the obvious,” SIGARCH Comput. Archit. tecture, 2014, pp. 545–557.
News, vol. 23, no. 1, pp. 20–24, 1995. [Online]. Available:
https://doi.org/10.1145/216585.216588 [13] R. Bera, A. V. Nori, O. Mutlu, and S. Subramoney,
“Dspatch: Dual spatial pattern prefetcher,” in Proceedings
[2] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and of the 52nd Annual IEEE/ACM International Symposium on
H. Sarbazi-Azad, “Bingo spatial data prefetcher,” in 2019 Microarchitecture, ser. MICRO ’52. New York, NY, USA:
IEEE International Symposium on High Performance Com- Association for Computing Machinery, 2019, p. 531–544.
puter Architecture (HPCA), 2019, pp. 399–411. [Online]. Available: https://doi.org/10.1145/3352460.3358325

[3] P. Michaud, “Best-offset hardware prefetching,” in 2016 IEEE [14] M. Sutherland, A. Kannan, and N. Enright Jerger, “Not quite
International Symposium on High Performance Computer my temp: Matching prefetches to memory access times,” in
Architecture (HPCA), 2016, pp. 469–480. Data Prefetching Championship Workshop, 2015.

[15] “Spec cpu 2006.” [Online]. Available: https://www.spec.org/


[4] E. Bhatia, G. Chacon, S. Pugsley, E. Teran, P. V. Gratz, cpu2006/
and D. A. Jiménez, “Perceptron-based prefetch filtering,” in
2019 ACM/IEEE 46th Annual International Symposium on [16] “Spec cpu 2017.” [Online]. Available: https://www.spec.org/
Computer Architecture (ISCA), 2019, pp. 1–13. cpu2017/

[5] P. Papaphilippou, P. H. J. Kelly, and W. Luk, “Pangloss: a [17] “Parsec.” [Online]. Available: http://parsec.cs.princeton.edu/
novel markov chain prefetcher,” CoRR, vol. abs/1906.00877,
2019. [Online]. Available: http://arxiv.org/abs/1906.00877 [18] J. Shun and G. E. Blelloch, “Ligra: A lightweight
graph processing framework for shared memory,” in
[6] H. Wu, K. Nathella, D. Sunwoo, A. Jain, and Proceedings of the 18th ACM SIGPLAN Symposium
C. Lin, “Efficient metadata management for irregular on Principles and Practice of Parallel Programming,
data prefetching,” in Proceedings of the 46th International ser. PPoPP ’13. New York, NY, USA: Association
Symposium on Computer Architecture, ISCA 2019, Phoenix, for Computing Machinery, 2013, p. 135–146. [Online].
AZ, USA, June 22-26, 2019, S. B. Manne, H. C. Hunter, and Available: https://doi.org/10.1145/2442516.2442530
E. R. Altman, Eds. ACM, 2019, pp. 449–461. [Online].
Available: https://doi.org/10.1145/3307650.3322225 [19] S. H. Pugsley, Z. Chishti, C. Wilkerson, P. Chuang, R. L.
Scott, A. Jaleel, S. Lu, K. Chow, and R. Balasubramonian,
“Sandbox prefetching: Safe run-time evaluation of aggressive
[7] R. Bera, K. Kanellopoulos, A. Nori, T. Shahroodi,
prefetchers,” in 20th IEEE International Symposium on
S. Subramoney, and O. Mutlu, Pythia: A Customizable
High Performance Computer Architecture, HPCA 2014,
Hardware Prefetching Framework Using Online Rein-
Orlando, FL, USA, February 15-19, 2014. IEEE Computer
forcement Learning. New York, NY, USA: Association
Society, 2014, pp. 626–637. [Online]. Available: https:
for Computing Machinery, 2021, p. 1121–1137. [Online].
//doi.org/10.1109/HPCA.2014.6835971
Available: https://doi.org/10.1145/3466752.3480114
[20] “Mcf homepage.” [Online]. Available: https://www.zib.de/
[8] H. Wu, K. Nathella, J. Pusdesris, D. Sunwoo, A. Jain, and opt-long projects/Software/Mcf/
C. Lin, “Temporal prefetching without the off-chip metadata,”
in Proceedings of the 52nd Annual IEEE/ACM International [21] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi,
Symposium on Microarchitecture, MICRO 2019, Columbus, “Cacti 6.0: A tool to model large caches,” HP laboratories,
OH, USA, October 12-16, 2019. ACM, 2019, pp. 996–1008. vol. 27, p. 28, 2009.
[Online]. Available: https://doi.org/10.1145/3352460.3358300
[22] “Champsim simulator.” [Online]. Available: https://github.
[9] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and com/ChampSim/ChampSim
A. Moshovos, “Spatial memory streaming,” in Proceedings
of the 33rd Annual International Symposium on Computer [23] E. Bhatia, G. Chacon, E. Teran, P. V. Gratz,
Architecture, ser. ISCA ’06. USA: IEEE Computer and D. A. Jiménez, “Enhancing signature path
Society, 2006, p. 252–263. [Online]. Available: https: prefetching with perceptron prefetch filtering,” 2019.
//doi.org/10.1109/ISCA.2006.38 [Online]. Available: https://dpc3.compas.cs.stonybrook.edu/
pdfs/Enhancing signature.pdf
[10] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilk- [24] “The 2nd data prefetching championship,” 2015. [Online].
erson, and Z. Chishti, “Path confidence based lookahead Available: http://comparch-conf.gatech.edu/dpc2/
prefetching,” in 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 2016, pp. 1–12. [25] “The 3rd data prefetching championship,” 2019. [Online].
Available: https://dpc3.compas.cs.stonybrook.edu/
[11] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran,
and H. Sarbazi-Azad, “Accurately and maximally [26] “Cloudsuite traces.” [Online]. Available:
prefetching spatial data access patterns with bingo,” 2019. https://www.dropbox.com/sh/pgmnzfr3hurlutq/
[Online]. Available: https://dpc3.compas.cs.stonybrook.edu/ AACciuebRwSAOzhJkmj5SEXBa/CRC2 trace?dl=0&
pdfs/Accurately.pdf subfolder nav tracking=1

1025

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
[27] “Jedec-ddr4.” [Online]. Available: https://www.jedec.org/
sites/default/files/docs/JESD79-4.pdf

[28] “Jedec-ddr5.” [Online]. Available: https://www.jedec.org/


standards-documents/docs/jesd79-5a

[29] “Hbm specification.” [Online]. Available: https://www.amd.


com/Documents/High-Bandwidth-Memory-HBM.pdf

[30] S. Jiang, Y. Ci, Q. Yang, and M. Li, Matryoshka: A


Coalesced Delta Sequence Prefetcher. New York, NY,
USA: Association for Computing Machinery, 2021. [Online].
Available: https://doi.org/10.1145/3472456.3473510

[31] A. J. Smith, “Sequential program prefetching in memory hier-


archies,” Computer, vol. 11, no. 12, pp. 7–21, 1978. [Online].
Available: https://doi.org/10.1109/C-M.1978.218016

[32] T. Chen and J. Baer, “Effective hardware based data


prefetching for high-performance processors,” IEEE Trans.
Computers, vol. 44, no. 5, pp. 609–623, 1995. [Online].
Available: https://doi.org/10.1109/12.381947

[33] M. Shevgoor, S. Koladiya, R. Balasubramonian,


C. Wilkerson, S. H. Pugsley, and Z. Chishti, “Efficiently
prefetching complex address patterns,” in Proceedings of
the 48th International Symposium on Microarchitecture,
MICRO 2015, Waikiki, HI, USA, December 5-9, 2015,
M. Prvulovic, Ed. ACM, 2015, pp. 141–152. [Online].
Available: https://doi.org/10.1145/2830772.2830793

[34] K. J. Nesbit and J. E. Smith, “Data cache prefetching using a


global history buffer,” IEEE Micro, vol. 25, no. 1, pp. 90–97,
2005. [Online]. Available: https://doi.org/10.1109/MM.2005.6

[35] A. Jain and C. Lin, “Linearizing irregular memory accesses


for improved correlated prefetching,” in The 46th Annual
IEEE/ACM International Symposium on Microarchitecture,
MICRO-46, Davis, CA, USA, December 7-11, 2013, M. K.
Farrens and C. Kozyrakis, Eds. ACM, 2013, pp. 247–259.
[Online]. Available: https://doi.org/10.1145/2540708.2540730

1026

Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.

You might also like