Merging Similar Patterns For Hardware Prefetching
Merging Similar Patterns For Hardware Prefetching
Abstract—One critical challenge of designing an efficient requirement comes from two aspects. First, unlike a simple
prefetcher is to strike a balance between performance and memory access pattern that can be described by a constant
hardware overhead. Some state-of-the-art prefetchers achieve stride, a complex memory access pattern requires varied
very high performance at the price of a very large storage
requirement, which makes them not amenable to hardware strides for description. It can be a considerable cost in
implementations in commercial processors. recording complex patterns through bit vectors [9], delta se-
We argue that merging memory access patterns can be a quences [10], etc. Second, many features, including Program
feasible solution to reducing storage overhead while obtaining Counters (PCs), memory addresses, memory address offsets,
high performance, although no existing prefetchers, to the memory address strides, etc, and their combinations can be
best of our knowledge, have succeeded in doing so because of
the difficulty of designing an effective merging strategy. After leveraged to index patterns for improving prefetch accuracy.
analysis of a large number of patterns, we find that the address A prefetcher for high performance tends to apply fine-
offset of the first access in a certain memory region is a good grained features with large value ranges to index thousands
feature for clustering highly similar patterns. Based on this of patterns, resulting in great storage consumption.
observation, we propose a novel hardware data prefetcher, The art of designing an efficient prefetcher, dealing with
named Pattern Merging Prefetcher (PMP), which achieves
high performance at a low cost. The storage requirement for complex memory access patterns, is to strike a balance
storing patterns is largely reduced and, at the same time, the between performance and storage overhead. We find that
prefetch accuracy is guaranteed by merging similar patterns the major portion of the storage consumption of some pre-
in the training process. In the prefetching process, a strategy fetchers is caused by severe data redundancy. For instance,
based on access frequencies of prefetch candidates is applied we find 82.9% of patterns are redundant in Bingo [2], [11],
to accurately extract prefetch targets from merged patterns.
According to the experimental results on a wide range of while astonishingly 24.2% of valid entries are allocated to
various workloads, PMP outperforms the enhanced Bingo by the same pattern. The low storage efficiency due to the high
2.6% with 30× lesser storage overhead and Pythia by 8.2% data redundancy is the main reason for Bingo using a large
with 6× lesser storage overhead. table with 16, 000 entries.
Keywords-cache; hardware data prefetching; Through analyzing a large number of memory access
patterns captured from 125 traces, we observe that the
I. I NTRODUCTION patterns are highly similar, if they are indexed by the same
Prefetching is one of the well-known techniques to speed address offset of the first access (named Trigger Offset) in a
up long-latency memory accesses for decades. With the ever- certain memory region. The access is named Trigger Access
increasing memory consumption of modern applications, a in the region. Selecting trigger offsets as features offers
cache hierarchy with limited capacity can be a bottleneck a great promise of designing an efficient pattern merging
in performance improvements of processors [1]. Because strategy, because the information loss is relatively small
caches can be effectively utilized in reducing memory access when merging similar patterns. As result, a prefetcher can
latency for reusable data, a large cache might be used consume less storage and obtain high performance if the
to improve performance further. However, larger caches patterns with the same trigger offset are merged. Because the
can lead to longer cache access latency. The performance patterns indexed by a trigger offset are not necessarily iden-
improvement is limited when enlarging the capacity of high- tical, the strategy must be deliberately designed such that
level caches in the hierarchy (e.g., L1 Data Cache, L1D) characteristics of patterns can be maintained after merging,
due to their strict latency requirements. Prefetchers are often and prefetch targets can be efficiently extracted from merged
employed to reduce memory access latency by prefetching patterns.
data in need without requiring a big on-chip area, leading In this paper, we design a pattern merging strategy that
to better cost performance than enlarging caches. can efficiently quantify characteristics of patterns. Moreover,
To accurately capture memory access patterns, some state- we design a strategy that can accurately extract prefetch
of-the-art prefetchers require more and more storage in targets from merged patterns based on access frequen-
recording memory access history [2]–[8]. The large storage cies of prefetch candidates. In a nutshell, we propose a
novel prefetcher, named Pattern Merging Prefetcher (PMP). and P + 3 have been accessed in memory region P . The
Through exhaustive experiments on 125 traces from different bit vector form is first leveraged in SMS, whose efficiency
benchmarks, PMP outperforms the enhanced Bingo by 2.6% has been proven for addressing complex memory access
with 30× lesser storage overhead and Pythia [7] by 8.2% patterns. Bulk Memory Access Prediction and Streaming
with 6× lesser storage overhead in a single-core system. (BuMP) [12] improves SMS by reducing memory energy
We make the following contributions in this paper: consumption. Bingo [2] improves SMS by using multiple
• We make a crucial observation that the memory access features, combining PCs with offsets or addresses, to accu-
patterns are highly similar if they have the same trigger rately locate patterns to be prefetched. Dual Spatial Pattern
offset, after detailed analysis of a large number of Prefetcher (DSPatch) [13] records memory access patterns
memory access patterns captured from 125 traces. with dual bit vectors, generated by an OR operation and
• We propose a novel prefetcher, named Pattern Merg- an AND operation respectively, and uses different prefetch
ing Prefetcher (PMP), including: a pattern merging policies according to the bandwidth of environment.
strategy that quantifies characteristics of patterns and The bit vector form has many advantages. First, any
reduces the storage consumption; an extraction strategy memory access distribution in a memory region can be rep-
based on access frequencies of prefetch candidates from resented with a bit vector. Second, bit vectors can represent
merged patterns for accurate prefetching; an optimiza- dozens of offsets in a memory region at a very low cost.
tion using a dual pattern table structure to provide Benefiting from the high information density of bit vectors,
multi-feature-based pattern prediction. a prefetcher can immediately generate many prefetch targets,
• We show that PMP obtains better performance at a low leading to deep prefetching. Though bit vectors do not
storage cost compared to four state-of-the-art prefet- maintain temporal information of memory accesses, i.e.,
chers over 125 traces by extensive experiments. access order, a prefetcher can apply some heuristic methods
to compensate for this drawback. For example, a prefetcher
The remaining sections are organized as follows. Sec-
can prefetch the nearest target to the currently accessed
tion II provides the background. Section III presents our
address in advance. For a prefetcher based on bit vectors,
motivations, i.e., three key observations of memory access
the performance improvement, gained by attaching temporal
patterns. Section IV describes the innovative mechanisms of
information, is not significant [14].
PMP. An exhaustive performance evaluation is presented in
Section V. Section VI discusses related work about hardware B. Pattern Capturing Framework of SMS
data prefetching with different pattern forms. Finally, the SMS adopts a lightweight pattern capturing framework,
paper is concluded in Section VII. which accounts for 2% of its total storage. It consists of
two set-associative tables: the Filter Table (FT) and the
II. BACKGROUND
Accumulation Table (AT). The FT records information of
We first briefly introduce some typical prefetchers based the first access to each memory region. The AT accumulates
on the bit vector pattern form. Next, we delve into the memory access patterns for each memory region. The pattern
bit vector pattern capturing framework of Spatial Memory capturing procedure of SMS is shown in Fig. 1. First, when
Streaming (SMS) [9], on which our prefetcher is based. the memory region of a new memory access misses in the
AT and the FT, the FT will allocate a new entry to store the
A. Prefetchers based on Bit Vectors PC and the address of the access. The offset of its address
A bit vector describes the accessed positions in a memory is a trigger offset. Second, when another access to the same
region, each bit of which corresponds to an offset in the region comes with a different offset, a bit vector is assembled
region. For example, the bit vector 1011 means that P , P +2, with the offsets of these two accesses and sent to the AT.
1013
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Figure 3: Pattern collision and pattern duplicate.
III. M OTIVATION patterns related to a feature value and the Pattern Dupli-
To analyze the characteristics of memory access patterns, cate Rate (PDR) as the number of feature values related
we use the framework of SMS to capture patterns in 125 to a pattern. Fig. 3 shows examples of pattern collisions
traces from SPEC CPU 2006 [15], SPEC CPU 2017 [16], and pattern duplicates. We say the pattern 1101 has two
PARSEC [17], and Ligra [18]. The FT is 4 × 16 set- duplicates because the feature values A and B both index
associative, the AT is 8 × 16 set-associative, and the pattern it, and the patterns 1101 and 0101 collide because they are
length is 64, matching with cachelines in 4KB pages. both indexed by the same feature value B. The PDR of
Observation 1: Only a tiny minority of memory access 1101 is 2 and the PCR related to B is 2. A greater PDR
patterns occur with high frequency. indicates higher data redundancy, since the same pattern
Because a bit vector consists of dozens of bits (64b), related to different feature values must be stored in multiple
the huge status space (264 ) might introduce a large amount entries in a classical set-associative cache. The averages of
of data that no caches can afford. Fortunately, majority the two metrics corresponding to various features are shown
patterns occur infrequently. According to our experiments, in Table I after analyzing 125 traces.
6.5×106 distinct patterns totally occur about 1.1×108 times Intuitively, prefetchers tend to use fine-grained features
among 125 traces, and 75.6% of distinct patterns appear with high bit widths, so that reduced PCRs can be obtained
only once. As shown in Fig. 2, the top 10 frequent patterns to effectively differentiate patterns. According to Table I,
account for 33.1% of the total occurrences. Note that these the 80-bit PC+Address feature has a PCR of 1.7, providing
patterns cover an extremely small portion (1.55 × 10−4 %) the highest resolution in recognizing patterns compared to
of distinct patterns. Moreover, the top 100 and the top the other listed features. However, large storage could be
1000 frequent patterns account for 57.4% and 73.8% of the wasted for storing redundant patterns, because the PDR of
total occurrences respectively. Because only a tiny minority the PC+Address feature is high (608.7). By evaluating Bingo
of patterns occur intensively, it is feasible to design a that uses the PC+Address feature, 82.9% of its patterns are
lightweight prefetcher with dozens of entries, in which only redundant at the end of simulation, and 24.2% of valid
highly frequent patterns are maintained. entries are occupied by the same pattern.
Observation 2: Indexing memory access patterns with For features containing addresses, their high PDRs indi-
fine-grained features can lead to severe data redundancy. cate that many patterns are shared among different memory
The indexing schemes used by state-of-the-art prefetchers regions. To decrease duplicate patterns caused by this reason,
[2]–[4], [9], [19] can hardly guarantee that the indexed mem- the high bits of addresses that represent regions cannot be
ory access patterns are unique in their storage. As result, used as a feature, so that the same patterns of different
a large amount of data redundancy might be introduced. regions can be indexed into the same one. Because it is
To quantitatively analyze the data redundancy of various difficult to find a feature that can eliminate the duplication,
indexing schemes using different features, we define the we try to reduce duplication with features of low PDRs.
Pattern Collision Rate (PCR) as the number of distinct A new problem appears if we attempt to leverage features
1014
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
(a) Trigger Offset-indexed pattern (b) Trigger Offset-indexed pattern
heat map for a MCF trace. heat map for an Astar trace.
1015
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
have the highest similarity. Trigger Offset can be a good fea- terns because the accesses to a big array can cross many
ture for implementing an efficient pattern merging method. memory regions. The patterns can also hardly be clustered
To show clustered patterns straightforwardly, we draw by the PC feature due to the different PCs of the two
patterns in several typical traces as heat maps. A point in a loops. In contrast, the trigger offsets can be a general and
heat map represents the magnitude of occurrences of patterns recognizable feature. First, the backward accesses to a big
that contain the corresponding offset. Fig. 5a is a heat map array will first load data from the end of a region, probably
that shows the representative memory access distributions resulting in the same big trigger offset for the same pattern
of a MCF trace. For most of memory accesses in the trace, in different regions. Second, three major patterns generated
the program tends to access a few positions around their by the loops can be differentiated by different big trigger
current access addresses. These frequently visited positions offsets as Fig. 5a shows. They could be mixed if the PCs
form a blue dotted slash in Fig. 5a. When certain big trigger were used as a feature.
offsets appear, the program tends to issue backward memory
IV. D ESIGN : PMP
accesses, which form three horizontal dotted lines at the
bottom. Fig. 5b shows another memory access distribution The prefetching mechanism of PMP consists of two
from an Astar trace, in which the patterns can be described processes working in parallel: the training process and the
through three slashes. This indicates that the memory ac- prefetching process. In this section, we first introduce a
cesses of Astar obey a constant stride pattern. Pattern Merging strategy in the training process, enabling
However, the representative patterns cannot be observed a substantial reduction of storage requirements without sac-
when the hashed PC+Address feature is used. Fig. 5c shows rificing performance. Second, we present a Prefetch Pattern
the memory access distributions of the MCF trace in this Extraction strategy that generates highly accurate prefetch
situation. As the figure illustrates, patterns are scattered into targets from merged patterns in the prefetching process.
all 64 sets. We can not tell the common memory access Third, we propose an optimized structure, Dual Pattern
distributions in the heat map. The common characteristics Tables, to leverage multiple features for improving prefetch
of memory access patterns are destroyed, probably resulting accuracy further. Arbitration rules are also proposed to
in inaccurate merging results. decide the final prefetch targets based on the predictions
Moreover, because the PDR of the PC feature is the of the dual pattern tables. Then, we put it together and
smallest in Table I, we would have thought that it could be summarize the main flows. Finally, we discuss the overhead.
a better choice compared to the other features. Surprisingly, A. Pattern Merging
we find the similarity of patterns indexed by PCs is lower
than the Trigger Offset feature according to Fig. 4. For The observations in Section III inspire us to build a
most traces, patterns clustered by the PC feature present storage efficient prefetcher through merging memory access
overlapped memory access distributions in each set, leading patterns clustered by trigger offsets. How to merge these
to limited recognition of patterns. In addition, PCs generally patterns is a key problem that directly determines the ef-
distribute patterns into several concentrated sets, which can fectiveness of our prefetcher. A bit-wise OR operation or
be leveraged to predict whether or not patterns can be a bit-wise AND operation could be an option for pattern
prefetched. As shown in Fig. 5d, those PC-indexed patterns merging, but the two operations are abandoned eventually.
are allocated into several sets, forming horizontal lines. The OR operation creates a superset of patterns, e.g., the
Finally, the PC feature is used to help predict prefetch levels union of pattern 1111 and any other pattern is always 1111.
as described later in Section IV-C. The AND operation generates a common subset of patterns,
Discussion. We think one major reason for Observation 3 e.g., the intersection of pattern 0000 and any other pattern
is that Trigger Offset can be a more general feature relating is always 0000. As demonstrated in the examples above, a
to memory access behaviors of programs compared to PCs few outlier samples can obscure the differences in memory
and addresses. For example, the backward memory accesses, access patterns completely, leading to inaccurate records.
as shown at the bottom of Fig. 5a, can be generated by the Based on extensive studies, we choose to merge bit vector
two loops of the following codes in MCF [20]: patterns by counting the number of occurrences of each
offset in bit vectors, instead of the two operations mentioned
// File: pflowup.c above. To achieve this goal, we apply a vector of counters
// Method: MCF_primal_update_flow (named Counter Vector), in which each element records
// data are stored in a big array the number of accesses to an offset, to merge patterns in
for(;iplus!=w;iplus=iplus->pred)
{...} a cluster. In the training process of PMP, the counters for
for(;jplus!=w;jplus=jplus->pred) offsets in a vector increase in parallel if the corresponding
{...} offsets are accessed in a bit vector pattern to be merged.
Fig. 6a illustrates an example of merging the bit vec-
Features containing addresses fail to cluster the same pat- tor (0, 1, 1, 0, 1, 0, 0, 0), captured from the access sequence
1016
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Figure 6: Architecture of PMP.
P + 2, P + 1, P + 4 in memory region P , into the counter to form a Prefetch Pattern, which is a vector of target cache
vector (3, 0, 3, 0, 3, 0, 0, 0). First, the bit vector needs shift- levels for each offset. We consider three different schemes.
ing into an anchored bit vector. The bit vector is converted Access-Number-based Extraction. A prefetch target for
into (1, 0, 1, 0, 0, 0, 0, 1) by left circular shifting 2 positions an offset is generated if the corresponding element of the
because the trigger offset is 2. Then, the first, the third, triggered counter vector is equal to or greater than a prefetch
and the last elements of the counter vector increase by threshold. We call this scheme Access-Number-based Ex-
1 respectively. Finally, the counter vector is updated to traction (ANE). For example, the counter vector (4, 2, 0, 1)
(4, 0, 4, 0, 3, 0, 0, 1). Note that the counter corresponding to can be converted to the prefetch pattern (0, L1, 0, L1) if the
the trigger offset will always be the first element of a shifted prefetch threshold for L1D is 1, then A+1 and A+3 are the
vector and increases in every merging operation, so it is prefetch targets for the current cacheline address A. Please
called Time Counter. note that the trigger offset (the first offset after shifting) itself
Because all the patterns are merged, old records cannot will never be prefetched. This scheme is easy to implement,
be simply evicted during the training process. Instead, all but the obvious drawback is that any target offset needs an
the elements in a counter vector are halved when the time inevitable cold start time to reach the prefetch threshold. For
counter saturates. This mechanism aims at reducing but a threshold T , an offset will not be prefetched until it has
keeping the effects of old records in the prefetching process. been visited T times, losing many prefetch chances.
Supposing that the maximum value of time counters is 3, Access-Ratio-based Extraction. For every element of
the counter vector (4, 0, 4, 0, 3, 0, 0, 1) is saturated in the the triggered counter vector, a ratio of it to the sum of
example above, then it is halved to (2, 0, 2, 0, 1, 0, 0, 0). all counters can be compared to a threshold for generating
The pattern merging strategy is efficient for two reasons. prefetch targets. We briefly call this scheme Access-Ratio-
First, because the patterns to be merged have high similarity, based Extraction (ARE). For example, given a prefetch
their common characteristics remain after merging, which threshold 1/4 for L1D, the ratios of each element in the
lays the groundwork for high prediction accuracy in the counter vector (4, 2, 0, 1) are (?, 2/3, 0, 1/3). Because the
prefetching process. Otherwise, the merged patterns would trigger offset is excluded, its ratio is ignored. Then, the
be inevitably ambiguous for prefetching, if the patterns being counter vector can be converted to the prefetch pattern
merged were irrelevant. Second, unlike the OR/AND opera- (0, L1, 0, L1). Then, A+1 and A+3 are the prefetch targets
tion, which accepts/rejects all the differences in patterns, our for the current cacheline address A. Though a counter vector
strategy quantifies characteristics of patterns and reduces the is halved when the time counter saturates, the ratios can
storage consumption. The statistical results after merging hardly be changed.1 The ARE avoids retraining after halving
can be leveraged for precisely predicting prefetch targets, if the pattern of following memory accesses does not vary,
which is described in the next subsection. so the prefetching process continues. However, the ARE
B. Prefetch Pattern Extraction implicitly limits the maximum number of prefetches in one
prediction, namely the prefetch depth. Because a prefetch
In the prefetching process, stored patterns can be triggered
target on an offset is generated only if its corresponding ratio
when a trigger access comes. Triggered patterns in the
is equal to or greater than a threshold, at most d offsets can
form of counter vectors cannot be simply replayed for
prefetching like prior prefetchers do. Therefore, a conversion
1 The values of counters in a counter vector decrease to half, so does
from triggered patterns to prefetch targets is required. Our
their sum. Therefore, the ratios will not change theoretically. However, the
strategy is to individually select target offsets by examining calculation results may be slightly affected because of precision limitations
the corresponding elements in the triggered counter vector in hardware.
1017
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
be prefetched for the prefetch threshold 1/d. For a triggered degree for PMP. When free Prefetch Queue (PQ) entries
vector containing 64 counters with the same value, e.g., a exist, PMP first assembles addresses using valid offsets in
stream pattern, no addresses can be prefetched unless the the prefetch pattern that are near the cacheline address of the
threshold is lower than 1/63 (exclude the trigger offset). trigger access and issues them to the corresponding cache
But a low threshold that is smaller than 1/63 may lead to levels. If PQ is full, the prefetching process is suspended.
inaccurate prefetching in most cases. The ARE introduces When any load with the address of the same region reappears
an unnecessary trade-off between the prefetch depth and the and free PQ entries exist, the process continues with the
prefetch accuracy. prefetch pattern in the PB. Please note that prefetch requests
Access-Frequency-based Extraction. Access frequen- are prohibited from occupying all MSHRs, at least one
cies of offsets can be a good criterion for prefetch target MSHR is remained for normal load/store requests.
selection. An access frequency is different from an access
ratio that the former indicates how many times an offset C. Multi-Feature-based Prediction
occurs in a period, and it is not influenced by the occurrences Based on Observation 3, we notice that the PC feature
of other offsets. Moreover, the access frequency of an offset of the trigger access can also be leveraged to help predict
can be easily calculated by dividing its counter by the prefetch targets. The combination of PCs and trigger offsets
time counter. The Access-Frequency-based Extraction (AFE) can be used as a feature like prior research does, but it
generates prefetch targets by comparing access frequencies may not be the best option. First, the concatenated feature
of offsets to a prefetch threshold. For example, given an L1D expands the index range greatly, necessitating a large table
prefetch threshold 1/4, the frequencies of each counter in the to store merged patterns. Second, patterns clustered by the
counter vector (4, 2, 0, 1) are (?, 2/4, 0, 1/4), so the counter combination of PCs and trigger offsets have lower similarity.
vector can be converted to the prefetch pattern (0, L1, 0, L1), The memory access patterns indexed by trigger offsets are
then A + 1 and A + 3 are the prefetch targets for the current separated into different sets again by PCs, bringing a greater
cacheline address A. This scheme has several advantages. divergence of patterns in a set than using the Trigger Offset
First, because the halving mechanism hardly affects the feature or the PC feature alone.
frequencies of offsets, the AFE avoids the extra retraining Dual Pattern Tables. To enable multi-feature-based pre-
process when the following memory access patterns do not diction and reduce the divergence of memory access patterns
change. More importantly, the AFE has better adaptability in a cluster, dual pattern tables are applied to maintain
to different kinds of patterns compared to the former two merged patterns indexed by trigger offsets and PCs respec-
schemes. On one side, the AFE does not have the cold tively as Fig. 6c illustrates. Because counter vectors collect
start problem. If an offset appears every time in the last all the patterns indexed by the same feature value without
T patterns, its frequency is 100% from the beginning of evicting, the vectors can be stored in tagless direct-mapped
training which exceeds any threshold. This is friendly for tables. The Offset Pattern Table (OPT) indexed with trigger
patterns with a few repetitions. On the other side, no implicit offsets is the primary pattern table, and the PC Pattern
restrictions are introduced by the AFE, since every offset Table (PPT) indexed with PCs is the supplement pattern
that frequently occurs can be independently selected. This table. During the training process, the dual pattern tables
is friendly for stream-like patterns. Finally, we use the AFE are updated simultaneously. In the prefetching process, the
as the default prefetch pattern extraction scheme. candidate prefetch patterns are independently predicted by
To reduce the cache pollution in high-level caches without the two tables when a trigger access comes. Two candidate
losing prefetch chances, PMP prefetches data into differ- prefetch patterns are given by the tables using the AFE
ent cache levels depending on various thresholds. In the described in the last subsection.
default configuration using the AFE, the confidence refers Arbitration. An arbiter is applied to decide the final
to frequencies. The threshold Tl1d for prefetching data into prefetch pattern depending on the predictions from the two
L1D is 50% and the threshold Tl2c for L2 Cache (L2C) tables. Though both pattern tables can give prefetch targets
is 15%. The target addresses, assembled using the current individually, it is better to discard the targets given by the
cacheline address and offsets with confidence greater than PPT that are not included in the targets given by the OPT.
or equal to Tl1d , are prefetched to L1D. The targets with The predictions of the OPT are more accurate than the PPT
confidence greater than or equal to Tl2c but less than Tl1d are according to the experimental result in Section V-E3. As
prefetched into L2C for reducing the risk of cache pollution shown in Fig. 6e, the arbitration rules are as follows: 1)
in L1D. Fig. 6b shows an example that the prefetch pattern prefetches aiming at the same target offset are issued to
is (0, 0, L1, 0, L1, 0, 0, L2) extracted from the counter vector L1D only if both pattern tables predict to prefetch data into
(4, 0, 4, 0, 3, 0, 0, 1). L1D; 2) if the same target offset is predicted by both tables in
As shown at the bottom of Fig. 6c, a new prefetch pattern which one table predicts to prefetch data into L2C, the target
is stored in the Prefetch Buffer (PB) and indexed by the re- will be prefetched into L2C; 3) if the PPT has no predictions,
gion address of the trigger access. There is no fixed prefetch the cache level of prefetches predicted by the OPT will be
1018
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Table III: Detailed Storage Overhead
Structure Entry Field Storage
Region Tag (33b)
Hashed PC (5b)
Filter Table 8×8 376Bytes
Trigger offset (6b)
LRU (3b)
Region Tag (35b)
Hashed PC (5b)
Figure 7: Training and prefetching flows of PMP. Accumulation
2 × 16 Trigger offset (6b) 456Bytes
Table
Bit Vector (64b)
LRU (4b)
Table II: Preset Parameters Offset Pattern
64 × 1
Counter
2560Bytes
Table Vector (320b)
Parameter Value PC Pattern Coarse Counter
OPT Counter Size 5b 32 × 1 640Bytes
Table Vector (160b)
PPT Counter Size 5b Region Tag (36b)
OPT Pattern Length 64 Prefetch Prefetch
PPT Pattern Length 32 1 × 16 332Bytes
Buffer Pattern (126 bits)
Region Size 4KB LRU (4b)
Monitoring Range 2 T otal ≈ 4.3KB
L1D Prefetch Threshold (Tl1d ) 50%
L2C Prefetch Threshold (Tl2c ) 15%
E. Overhead
Table II lists all the preset parameters of PMP. Table III
downgraded (e.g., L2C to Last Level Cache, LLC); 4) no lists the overhead details. In our default configuration, the
prefetches will be issued if there are no predictions from OPT indexed with trigger offsets contains 64 entries, and
the OPT. Please note that the final prefetch pattern after the PPT indexed with PCs has 32 entries. The FT and the
arbitration is stored in the PB. AT have 64 and 32 entries respectively. The PB can store
Coarse Counter Vector. Because the predictions from 16 prefetch patterns. Two bits are enough for four states of
the PPT only affect prefetch cache levels, the storage costs every offset: No Prefetch, Prefetch to L1D, Prefetch to L2C,
can be reduced further by monitoring several offsets with and Prefetch to LLC. Therefore, a prefetch pattern requires
one counter. The number of offsets monitored by a counter 126 bits for 63 targets. The size of memory regions that each
is called Monitoring Range. As Fig. 6d depicts, the union pattern corresponds to is 4KB. Finally, the total hardware
of every several near bits of a bit vector is counted by overhead of PMP is 4.3KB (30× lower than Bingo).
a shorter counter vector, named Coarse Counter Vector. We use CACTI [21] with its 22nm configuration to
The 8-bit vector 10100001 is reduced to 1101 by joining estimate the area consumption and the cache access time
every two bits, then 1101 is merged into the coarse counter of our design. The area of the dual pattern table structure
vector (3, 1, 0, 1). Every element in a coarse counter vector is 0.0069 mm2 . It is 151× smaller than the large set-
controls whether/where to prefetch on adjacent offsets at the associative pattern table in Bingo, which costs 1.0372 mm2 .
prefetch level arbitration step. The final prefetch pattern in The total access time (input and output) of the dual pattern
the example of Fig. 6 is (0, 0, L1, 0, L2, 0, 0, L2) based on table structure is 0.1ns, which is 11× shorter than the access
the two candidate prefetch patterns: (0, 0, L1, 0, L1, 0, 0, L2) time of Bingo’s pattern table.
from the OPT and (0, L1, 0, L2) from the PPT.
V. E VALUATION
A. Methodology
D. Putting It Together
1) Configuration: We use the ChampSim [22] simulator
Fig. 7 shows the training and the prefetching flows of to evaluate prefetchers. Table IV illustrates the system
PMP. The training process performs on L1D loads. If the configuration of ChampSim. We compare PMP to four
region of an L1D load misses in the AT and the FT, it is a state-of-the-art prefetchers: DSPatch [13], Bingo [2], [11],
trigger access in the region. The pattern capturing framework SPP+PPF [4], [23], and Pythia [7]. For a fair comparison,
records the following accesses in the region as described in all prefetchers are placed at L1D, and no helper prefetchers
Section II-B and the captured pattern is merged in the OPT exist in the other cache levels. That is, the five prefetchers are
and the PPT. The prefetching process performs at the same all single-level in our evaluation. DSPatch is a lightweight
time as the trigger access comes. The offset and the PC of bit-vector-based prefetcher that records patterns through an
the trigger access are used to trigger (index) the OPT and OR and an AND operations. The competition version of
the PPT respectively. The final prefetch pattern is obtained Bingo in the third Data Prefetching Championships (DPC)
after the extraction and the arbitration, and it is then stored [11] is the most powerful single-level prefetcher according to
in the PB for prefetching. our experiments with a couple of state-of-the-art prefetchers.
1019
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Table IV: Simulated System Configuration Table VI: Traces
Core One to four cores, 4GHz, 4-wide, 352-entry SPEC 06 SPEC 17 Ligra PARSEC Total
ROB, 128-entry LQ, 72-entry SQ, 4KB Page 38 36 42 9 125
TLBs 64-entry ITLB, 64-entry DTLB, 1536-entry
L2DTLB
L1I 32KB, 8-way, 32-entry PQ, 8-entry MSHR, Table VII: Heterogeneous 4-core Workloads
4 cycles
MIX #
L1D 48KB, 12-way, 8-entry PQ, 16-entry MSHR,
All Low MPKI 10
5 cycles
All Medium MPKI 10
L2C 512KB, 8-way, 16-entry PQ, 32-entry MSHR,
All High MPKI 10
10 cycles
Half Low and Half Medium MPKI 10
LLC 2MB to 8MB, 16-way, 32 to 128-entry PQ,
Half Low and Half High MPKI 10
64 to 256-entry MSHR, 20 cycles, Inclusive
Half Medium and Half High MPKI 10
DRAM 4GB 1 channel (1-core), 8GB 2
channels (4-core), 3200 MT/sec
Table V: Prefetcher Overhead storage cost. PMP improves the performance of the non-
prefetching baseline by 65.2% and outperforms DSPatch,
DSPatch Bingo SPP+PPF Pythia PMP
3.6KB 127.8KB 48.4KB 25.5KB 4.3KB Bingo, SPP+PPF and Pythia by 41.3% (up to 177.4%),
2.6% (up to 62.4%), 6.5% (up to 59.2%) and 8.2% (up to
183.1%). DSPatch requires the lowest storage among the
We enhance it by doubling the size of its pattern table to five prefetchers. However, the low performance of DSPatch
match its original configuration [2]. SPP+PPF is a strong indicates that its pattern capturing strategies, an OR and an
competitor in DPC-3 [23], leveraging nine different features. AND operations, are inefficient. Bingo is a heavy prefetcher
Pythia is a new type of prefetcher built on machine learning that is 3× larger than a typical L1D, so it is more realistic
in hardware. Table V shows the storage overhead of these for it to be placed at low-level caches, which brings lower
prefetchers. performance. PMP (at L1) outperforms the original Bingo
2) Benchmarks: Over one hundred traces, captured from at LLC by 16.5%. SPP+PPF leverages nine features for
SPEC CPU 2006 [15], SPEC CPU 2017 [16], PARSEC filtering predictions generated by an aggressive Signature
[17], and Ligra [18], are used to evaluate the single-core Path Prefetcher (SPP) [10], which performs much better than
performance of the five prefetchers. For SPEC CPU 2006 the original SPP. Though SPP+PPF applies more features
and SPEC CPU 2017, we use the instruction traces provided compared to Bingo and PMP, its performance is lower,
by DPC-2 and DPC-3 [24], [25]. For PARSEC and Ligra, which indicates that the number of features is not the
we use the traces provided by Pythia [7]. All the traces have more the better. In contrast to PMP that can issue dozens
more than five LLC misses per kilo instructions (MPKI). of prefetches in one prediction, Pythia only generates one
Table VI lists the numbers of traces for each benchmark. prefetch target per prediction, which limits its prefetch depth
We evaluate the multi-core performance of the five and performance. Pythia cannot deeply prefetch due to its
prefetchers with homogeneous and heterogeneous multi- poor prefetch accuracy, which will be described in the next
programmed workloads in a 4-core processor. For homoge- subsection.
neous workloads, we examine prefetchers with 125 traces, On the desktop or scientific workloads such as SPEC CPU
each of which is simultaneously performed by different 2006 and SPEC CPU 2017, the performance of PMP is better
cores. For heterogeneous workloads, we first classify traces than that on the other workloads, since the memory access
into three classes: Low MPKI (5 < MPKI ≤ 10), Medium patterns of these workloads are regular. Though prefetchers
MPKI (10 < MPKI ≤ 20), and High MPKI (MPKI > 20). usually require large hardware storage to deal with irregular
Then, we randomize traces according to their classes and memory accesses generated by Ligra and PARSEC, PMP
generate workloads as Table VII shows. We also examine still outperforms the heavy prefetchers (Bingo, SPP+PPF
prefetchers with CloudSuite traces [26] but do not see many and Pythia) at a lower cost on these traces.
performance improvements, so the results are omitted.
C. Coverage & Accuracy in Single-core
In both single-core and multi-core systems, we use the
first 50 million instructions of a trace to warm up micro- Prefetch coverage and prefetch accuracy are two metrics
architectural structures and the next 200 million instructions that explain a lot about performance of prefetchers. We
to examine the performance of each core. define the coverage as the ratio of reduced load misses to
the total load misses of the baseline, and the accuracy as the
B. Single-core Performance ratio of useful prefetches to the sum of useful and useless
Fig. 8 illustrates the normalized IPCs (NIPCs) of the prefetches. PMP, SPP+PPF, and Bingo prefetch data into
five prefetchers in the single-core system. We observe that multiple levels of caches directly: PMP can fill data into
PMP outperforms the other four prefetchers at a very low L1D, L2C, and LLC; SPP+PPF and Bingo can fill data into
1020
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Figure 8: Single-core performance of five prefetchers. Figure 10: Average useful and useless prefetches.
1021
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Table IX: Performance and Overhead of PMP Under Differ-
ent Pattern Lengths
Pattern Region
Overhead NIPC
Length Size
64 4KB 4.3KB 1.652
Figure 11: Main structure of Design B. 32 2KB 2.5KB 1.626
16 1KB 1.6KB 1.572
E. Performance Analysis in Single-core nities are lost. The performance of the ARE is very low on
1) Pattern Merging Strategy: In Section V-B, the compar- many traces that contain a large number of stream patterns.
ison between PMP and DSPatch has indicated that it can be Furthermore, PMP obtains no performance improvements
inefficient to merge patterns by an OR or an AND operation. by tuning the thresholds of the ARE. Though the similar
Other than the merging strategy of PMP, it could be possible methods of the ARE work well in many prefetchers [4],
to obtain low storage consumption and good performance by [10], [30], the ARE is not suitable for our prefetcher.
merging the identical patterns only, considering spatial and Because the ANE has the cold start time problem, and
temporal locality of workloads. A bit vector attached with a the prefetching process could be interrupted by the halving
counter can be used to count repetitions of identical patterns, mechanism, the performance of the ANE is slightly im-
in which the counter can be used for target pattern selection, pacted. PMP using the ANE achieves a 60.3% improvement
e.g., the ANE scheme can be used to determine the prefetch beyond the baseline, which is 2.9% lower than the AFE. The
pattern according to the counter. Instead of individually ANE can be another feasible scheme with reduced hardware
selecting each offset, all the valid offsets in the bit vector complexity compared to the AFE.
with a counter that exceeds the threshold are the prefetch 3) Multi-Feature-based Prediction: We compare the per-
targets. The enhanced bit vectors can be stored in a set- formance of the dual pattern table structure to the single
associative cache indexed with trigger offsets. This design table indexed with the combination of PCs and trigger
is named Design B, whose main structure is illustrated in offsets. Because PMP uses 6-bit trigger offsets and 5-bit
Fig. 11. We compare Design B to the pattern merging PCs, the number of entries storing patterns increases from
strategy of PMP. Table VIII depicts that the performance 96 (26 + 25 ) to 2048 (26+5 ) when the combined feature is
of Design B grows as the associativity increases. However, used. However, the performance degrades by 3.1% compared
PMP outperforms Design B with 512 ways by 34.9%. Our to the dual pattern table structure.
pattern merging strategy is more efficient and eliminates the We also evaluate the performance of PMP with a single
impact of a large number of evictions. Bingo does not suffer feature. The performance reduces by 2.4% when using a
from severe evictions because of using features with low single OPT. Moreover, we attempt to use a single PPT with
PCRs. the same size as the OPT, and the performance reduces by
2) Prefetch Pattern Extraction Schemes: We consider 3.5%. We find that the single-feature-based PMP prefetches
three schemes for the prefetch pattern extraction in Sec- more data into higher-level caches than the multi-feature-
tion IV-B. For the ANE, the Tl1d is 16 and the Tl2c is 5 based PMP because of a lack of the prefetch level arbitration.
to scale close to the default thresholds of the AFE. The As result, more useless prefetches are produced. For exam-
thresholds are the same between the ARE and the AFE. ple, PMP with the single PPT generates 38.3% more useless
Comparing to the AFE, the ARE performs poorly, which prefetches at L1D compared to the default PMP. The large
improves the baseline by 5.0%. We find that the poor number of useless prefetches at L1D hurt its performance.
adaptability of the ARE for prefetching severely limits the 4) Preset Parameters: Pattern Length. The length of
prefetch coverage. Because stream patterns can hardly be counter vectors depends on the size of tracked memory
extracted by the ARE, a large number of prefetch opportu- regions. Given a 4KB memory region, the accesses to each
1022
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
Table XI: Performance of PMP Under Different Pattern
Monitoring Ranges
Monitoring
1 2 4 8
Range
NIPC 1.650 1.652 1.630 1.615
1023
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
6.9% respectively. PMP can match the performance of Bingo in a reduction of data redundancy. Moreover, no additional
on homogeneous and heterogeneous workloads. Because the features are required for indexing the patterns described
aggressive prefetching strategy consumes a larger amount of through delta sequences. However, prefetchers relying on
bandwidth resources in the 4-core system, the performance delta sequences are not accurate enough. To address this
improvement is slightly reduced compared to PMP in the obstacle, Perceptron-based Prefetch Filtering (SPP+PPF) [4]
single-core system. After limiting prefetching, PMP-Limit designs a heavy perceptron-based filter with nine features to
obtains 1% higher performance compared to Bingo. improve the accuracy of SPP. Moreover, prefetchers based
on delta sequences have to yield prefetch addresses step by
VI. R ELATED W ORK step using a recursive look-ahead strategy [4], [5], [10], [30].
A. Prefetchers based on Constant Strides It is difficult for prefetchers, relying on delta sequences, to
Some memory access patterns can be described using issue dozens of prefetches immediately like Bingo does for
constant strides. A stride is the difference between two deep prefetching.
continuous access addresses. This is a straightforward de-
scription of memory access patterns, which is widely used C. Other Hardware Data Prefetchers
by prefetchers. For example, Next Line Prefetcher (NL) [31] Some other prefetchers pay attention to irregular mem-
can be seen as a prefetcher that always prefetches data ory access patterns such as Global History Buffer (GHB)
by one cacheline stride. Constant-stride-based prefetchers [34], Irregular Stream Buffer (ISB) [35], Managed Irregular
allow different strides for different memory regions [32]. Stream Buffer (MISB) [6], and Triage [8]. Because irregular
Best Offset Prefetcher (BOP) [3] periodically calculates the memory access patterns usually cross many pages, which
confidence of different strides and chooses the best stride are difficult to be described through pattern forms such as
for prefetching. Sandbox Prefetcher (SP) [19] applies a constant strides, delta sequences, or bit vectors. GHB uses
similar method. The difference between BOP and SP is a large circular history buffer to record access history. ISB
that the former records real memory accesses for calculating reconstructs physical addresses into structural addresses and
confidence and the latter uses a bloom filter to record fake stores them in memory. MISB improves ISB by filtering
prefetches instead. Though prefetchers built on constant unnecessary memory requests with bloom filters. Triage or-
strides can improve performance at very low storage costs, ganizes patterns as key-value pairs of addresses and requires
it can be hard to predict complex patterns. Because constant up to the half storage of a LLC for eliminating requirements
strides only assume that future accesses arrive at a certain of off-chip storage. Most of them require too much storage
pace, the patterns that consist of variable strides, like the and are only effective in irregular situations, which makes
address sequence (P + 1, P + 2, P + 4, P + 3, P + 1) in the their design unaffordable in general processors.
memory region P , are beyond their capabilities.
VII. C ONCLUSION
B. Prefetchers based on Delta Sequences
Delta sequences use several deltas to describe the variation In this paper, we design a low-overhead and powerful
of strides for access sequences, e.g., the delta sequence for prefetcher, named Pattern Merging Prefetcher (PMP). By
the address sequence (P +1, P +2, P +4, P +3, P +1) in the analyzing many features, we find that the Trigger Offset fea-
memory region P is (1, 2, −1, −2). Different from the other ture can be used to cluster memory access patterns with high
pattern forms that require addition information for indexing, similarity. By merging similar patterns, the storage overhead
delta sequences of memory accesses can be utilized to index of our prefetcher is largely reduced, and an accurate prefetch
themselves, i.e., the prefixes of delta sequences can be used pattern extraction strategy with various effective schemes
as a feature. For example, if the prior access history forms a is applied. Moreover, an optimized prediction mechanism
delta sequence (1, 2, −1), we can predict the next delta is −2 based on multiple features is proposed. PMP outperforms
in the example above. Signature Path Prefetcher (SPP) [10] the non-prefetching system by 65.2%, and exceeds the
captures patterns using five-delta sequences. The first four performance of enhanced Bingo by 2.6% with 30× lesser
deltas of a sequence are compressed as a signature to trigger storage overhead and Pythia by 8.2% with 6× lesser storage
the last delta. Variable Length Delta Prefetcher (VLDP) overhead in a single-core system.
[33] uses separated tables to maintain delta sequences with We believe that there is still room for further exploration
different lengths. Matryoshka [30] supports variable-length in the design idea of pattern merging. In the future, we plan
sequence matching by coalescing variable-length sequences to do further research on fundamental reasons behind the
into a single table. three observations and try to apply the design idea to other
Delta sequences can record the spatial and temporal predictors in processors for improving performance.
information of memory access patterns. An advantage of
ACKNOWLEDGMENTS
recording memory access patterns through delta sequences is
that the common parts of sequences can be shared, resulting We thank the anonymous reviewers for their feedback.
1024
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [12] S. Volos, J. Picorel, B. Falsafi, and B. Grot, “Bump: Bulk
memory access prediction and streaming,” in 2014 47th
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Annual IEEE/ACM International Symposium on Microarchi-
implications of the obvious,” SIGARCH Comput. Archit. tecture, 2014, pp. 545–557.
News, vol. 23, no. 1, pp. 20–24, 1995. [Online]. Available:
https://doi.org/10.1145/216585.216588 [13] R. Bera, A. V. Nori, O. Mutlu, and S. Subramoney,
“Dspatch: Dual spatial pattern prefetcher,” in Proceedings
[2] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and of the 52nd Annual IEEE/ACM International Symposium on
H. Sarbazi-Azad, “Bingo spatial data prefetcher,” in 2019 Microarchitecture, ser. MICRO ’52. New York, NY, USA:
IEEE International Symposium on High Performance Com- Association for Computing Machinery, 2019, p. 531–544.
puter Architecture (HPCA), 2019, pp. 399–411. [Online]. Available: https://doi.org/10.1145/3352460.3358325
[3] P. Michaud, “Best-offset hardware prefetching,” in 2016 IEEE [14] M. Sutherland, A. Kannan, and N. Enright Jerger, “Not quite
International Symposium on High Performance Computer my temp: Matching prefetches to memory access times,” in
Architecture (HPCA), 2016, pp. 469–480. Data Prefetching Championship Workshop, 2015.
[5] P. Papaphilippou, P. H. J. Kelly, and W. Luk, “Pangloss: a [17] “Parsec.” [Online]. Available: http://parsec.cs.princeton.edu/
novel markov chain prefetcher,” CoRR, vol. abs/1906.00877,
2019. [Online]. Available: http://arxiv.org/abs/1906.00877 [18] J. Shun and G. E. Blelloch, “Ligra: A lightweight
graph processing framework for shared memory,” in
[6] H. Wu, K. Nathella, D. Sunwoo, A. Jain, and Proceedings of the 18th ACM SIGPLAN Symposium
C. Lin, “Efficient metadata management for irregular on Principles and Practice of Parallel Programming,
data prefetching,” in Proceedings of the 46th International ser. PPoPP ’13. New York, NY, USA: Association
Symposium on Computer Architecture, ISCA 2019, Phoenix, for Computing Machinery, 2013, p. 135–146. [Online].
AZ, USA, June 22-26, 2019, S. B. Manne, H. C. Hunter, and Available: https://doi.org/10.1145/2442516.2442530
E. R. Altman, Eds. ACM, 2019, pp. 449–461. [Online].
Available: https://doi.org/10.1145/3307650.3322225 [19] S. H. Pugsley, Z. Chishti, C. Wilkerson, P. Chuang, R. L.
Scott, A. Jaleel, S. Lu, K. Chow, and R. Balasubramonian,
“Sandbox prefetching: Safe run-time evaluation of aggressive
[7] R. Bera, K. Kanellopoulos, A. Nori, T. Shahroodi,
prefetchers,” in 20th IEEE International Symposium on
S. Subramoney, and O. Mutlu, Pythia: A Customizable
High Performance Computer Architecture, HPCA 2014,
Hardware Prefetching Framework Using Online Rein-
Orlando, FL, USA, February 15-19, 2014. IEEE Computer
forcement Learning. New York, NY, USA: Association
Society, 2014, pp. 626–637. [Online]. Available: https:
for Computing Machinery, 2021, p. 1121–1137. [Online].
//doi.org/10.1109/HPCA.2014.6835971
Available: https://doi.org/10.1145/3466752.3480114
[20] “Mcf homepage.” [Online]. Available: https://www.zib.de/
[8] H. Wu, K. Nathella, J. Pusdesris, D. Sunwoo, A. Jain, and opt-long projects/Software/Mcf/
C. Lin, “Temporal prefetching without the off-chip metadata,”
in Proceedings of the 52nd Annual IEEE/ACM International [21] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi,
Symposium on Microarchitecture, MICRO 2019, Columbus, “Cacti 6.0: A tool to model large caches,” HP laboratories,
OH, USA, October 12-16, 2019. ACM, 2019, pp. 996–1008. vol. 27, p. 28, 2009.
[Online]. Available: https://doi.org/10.1145/3352460.3358300
[22] “Champsim simulator.” [Online]. Available: https://github.
[9] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and com/ChampSim/ChampSim
A. Moshovos, “Spatial memory streaming,” in Proceedings
of the 33rd Annual International Symposium on Computer [23] E. Bhatia, G. Chacon, E. Teran, P. V. Gratz,
Architecture, ser. ISCA ’06. USA: IEEE Computer and D. A. Jiménez, “Enhancing signature path
Society, 2006, p. 252–263. [Online]. Available: https: prefetching with perceptron prefetch filtering,” 2019.
//doi.org/10.1109/ISCA.2006.38 [Online]. Available: https://dpc3.compas.cs.stonybrook.edu/
pdfs/Enhancing signature.pdf
[10] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilk- [24] “The 2nd data prefetching championship,” 2015. [Online].
erson, and Z. Chishti, “Path confidence based lookahead Available: http://comparch-conf.gatech.edu/dpc2/
prefetching,” in 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 2016, pp. 1–12. [25] “The 3rd data prefetching championship,” 2019. [Online].
Available: https://dpc3.compas.cs.stonybrook.edu/
[11] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran,
and H. Sarbazi-Azad, “Accurately and maximally [26] “Cloudsuite traces.” [Online]. Available:
prefetching spatial data access patterns with bingo,” 2019. https://www.dropbox.com/sh/pgmnzfr3hurlutq/
[Online]. Available: https://dpc3.compas.cs.stonybrook.edu/ AACciuebRwSAOzhJkmj5SEXBa/CRC2 trace?dl=0&
pdfs/Accurately.pdf subfolder nav tracking=1
1025
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.
[27] “Jedec-ddr4.” [Online]. Available: https://www.jedec.org/
sites/default/files/docs/JESD79-4.pdf
1026
Authorized licensed use limited to: Peking University. Downloaded on October 05,2024 at 15:10:53 UTC from IEEE Xplore. Restrictions apply.