0% found this document useful (0 votes)
11 views27 pages

CA Paper 2

This document discusses improving the performance of hybrid caches that contain both SRAM and STT-RAM. It proposes using a smaller, fully associative SRAM victim cache to store blocks evicted from the hybrid cache. When there is a miss in the hybrid cache but a hit in the victim cache, an intelligent block placement policy is used to place the block in the appropriate region of the hybrid cache. The victim cache is also dynamically partitioned based on runtime load to balance uneven evictions from the hybrid cache regions. Experimental results show significant performance improvements and reductions in execution time and miss rate compared to existing techniques.

Uploaded by

Javaria Rasul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

CA Paper 2

This document discusses improving the performance of hybrid caches that contain both SRAM and STT-RAM. It proposes using a smaller, fully associative SRAM victim cache to store blocks evicted from the hybrid cache. When there is a miss in the hybrid cache but a hit in the victim cache, an intelligent block placement policy is used to place the block in the appropriate region of the hybrid cache. The victim cache is also dynamically partitioned based on runtime load to balance uneven evictions from the hybrid cache regions. Experimental results show significant performance improvements and reductions in execution time and miss rate compared to existing techniques.

Uploaded by

Javaria Rasul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Improving the Performance of Hybrid Caches Using

Partitioned Victim Caching

SUKARN AGARWAL and HEMANGEE K. KAPOOR, Indian Institute of Technology Guwahati

Non-Volatile Memory technologies are coming as a viable option on account of the high density and low-
leakage power over the conventional SRAM counterpart. However, the increased write latency reduces their
chances as a substitute for SRAM. To attenuate this problem, a hybrid STT-RAM-SRAM architecture is pro-
posed where with large STT-RAM ways, the small SRAM ways are incorporated for handling the write op-
erations. However, the performance gain obtained from such an architecture is not as much as expected on
account of the larger miss rate caused by smaller SRAM partition. This, in turn, may limit the amount of
cache capacity.
This article attempts to reduce the miss penalty and improve the average memory access time by retaining
the victims evicted from the hybrid cache in a smaller, fully associative SRAM structure called the victim
cache. The victim cache is accessed on a miss in the primary hybrid cache. Hits in the victim cache require an
exchange of the block between the main hybrid cache and the victim cache. In such cases, to effectively place
the required block in the appropriate region of the main hybrid cache, we propose an access-based block
placement technique. Besides, to manage the runtime load and the uneven evictions of the SRAM partition,
we also present a dynamic region-based victim cache partitioning method to hold the victims dedicated to
each region. Experimental evaluation on a full system simulator shows significant improvement in the perfor-
mance and execution time along with a reduction in the overall miss rate. The proposed policy also increases
the endurance of Hybrid Cache Architectures (HCA) by reducing writes in the STT partition.
CCS Concepts: • Hardware → Memory and dense storage; • Computer systems organization → Mul-
ticore architectures;
Additional Key Words and Phrases: Hybrid cache, victim cache, non-volatile, STT-RAM, partitioning
ACM Reference format: 5
Sukarn Agarwal and Hemangee K. Kapoor. 2020. Improving the Performance of Hybrid Caches Using Parti-
tioned Victim Caching. ACM Trans. Embed. Comput. Syst. 20, 1, Article 5 (December 2020), 27 pages.
https://doi.org/10.1145/3411368

1 INTRODUCTION
With the recent advancements in the CMOS technologies, more and more cores are integrated on
the same die leading to multi-core embedded systems. The major requirements of the embedded
systems are the high Performance:Power ratio and small area footprint with larger memory
capacity. Traditional caches made up of SRAM are falling short of fulfilling these requirements
due to their large static power consumption, low density, and low scalability. In other words, for

Authors’ addresses: S. Agarwal and H. K. Kapoor, Department of Computer Science and Engineering, Indian Institute of
Technology Guwahati, Assam, India; emails: {sukarn, hemangee}@iitg.ac.in.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2020 Association for Computing Machinery.
1539-9087/2020/12-ART5 $15.00
https://doi.org/10.1145/3411368

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:2 S. Agarwal and H. K. Kapoor

Table 1. Percentage Increase in Miss Rate by RWHCA against Baseline STT-RAM

Work- PARSEC v2.1 SPEC CPU 2006 MIBench


load
Cache Cann Ded Fluid Freq Stream X264 SMix1 SMix2 SMix3 SMix4 MMix1 MMix2 MMix3 MMix4
Config.
1MB 5.46% 15.3% 8.65% 18.3% 3% 6.8% 8.74% 7.82% 5.30% 12.30% 10.14% 18.3% 8.72% 10.32%
2MB 3.17% 11.7% 4.77% 16.25% 2.42% 5.36% 7.80% 6.23% 4.51% 10.32% 9.70% 18% 4.51% 3.15%
4MB 2.32% 10.8% 4.66% 15.7% 2.3% 4.46% 7.37% 5.40% 3.90% 8.51% 7.8% 6.23% 3% 0.75%
8MB 2.69% 10.3% 4.55% 14% 2.04% 3.27% 6.35% 4.01% 3.13% 5.60% 6.4% 2.52% 1.43% 0.1%

small-sized embedded systems, in the given area footprint, the conventional SRAM caches are not
able to meet the capacity requirements and also suffer from the high static power consumption,
which increases the power density. Recently, the developments in emerging Non-Volatile Memory
(NVM) technologies have diverted the attention of architects/researchers to build the memory
hierarchy using these technologies. These memory technologies comprise Resistive Random
Access Memory (ReRAM) [17], Spin-Transfer Torque Random Access Memory (STT-RAM) [6],
and Phase-Change Random Access Memory (PCRAM) [26]. The advantages of including these
memory technologies in the hierarchy are low static power consumption, high density, and
good scalability. Despite these advantages, their main drawbacks are weak write endurance and
costly write operations (latency as well as energy). Further, comparatively among these memory
technologies, the write energy of PCRAM and the write endurance of ReRAM are worst, which
makes STT-RAM as the viable candidate for the Last Level Cache (LLC). However, the asymmetric
read and write latency and energy reduce the possibilities of STT-RAM to replace their SRAM
counterpart completely.
To overcome the drawbacks of STT-RAM, researchers have introduced the concept of Hybrid
Cache Architecture (HCA) [2, 31] where the cache bank consists of a large portion of NVM ways
and a small portion of SRAM ways. The use of small SRAM ways is to handle the maximum write
accesses of the bank, thereby saving the costly write NVM accesses, which incurs extra latency
and energy. However, by partitioning the LLC into two regions and by the use of small size SRAM
region for the write-intensive workloads, the performance benefits obtained from the hybrid cache
are not as much as expected. This is due to the increase in miss rate on account of the less residency
of the write-intensive blocks in the limited sized SRAM region. Table 1 presents the increase in
the miss rate for one of the existing hybrid cache policies: Read Write Aware Hybrid Cache Ar-
chitecture (RWHCA) [31] over the baseline pure STT-RAM cache that uses LRU as a replacement
policy, for different types of multi-programmed, multi-threaded, and embedded workloads (details
about the experimental setup are reported in Section 5.1). The conclusion that can be derived from
the table is that as the cache size becomes larger, the residency of the blocks increases, which in
turn improves the miss rate. At the same time, due to increased miss rate, the smaller-sized caches
(which are generally used in embedded systems) suffer from the performance. To counter this, in
this article, we propose a technique to improve the performance of hybrid cache by incorporating
a small fully associative SRAM-based Victim Cache (VC) [14]. As per our knowledge, this is the
first work that exploits the use of victim cache with the NVM-based hybrid cache. Blocks evicted
from hybrid cache are stored in victim cache. During the normal cache access, upon a miss in the
hybrid cache, the victim cache is searched. If the required block is found in the victim cache, it
has to be moved to the hybrid cache. In this case, we propose a policy that intelligently places the
blocks from the victim cache to an appropriate region of hybrid cache depending on the type of
request. We also propose to partition the victim cache dynamically into STT and SRAM region to

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:3

Fig. 1. Overview of STT-RAM cell: (a) Conceptual view of STT-RAM cell (b) Schematic STT view with
(1) Write “1” operation (2) Write “0” and Read operation (c) Parallel low resistance, representing “0” state
(d) Anti-parallel high resistance, representing ‘‘1” state.

balance the uneven block evictions from the different regions of the hybrid cache according to run-
time load. In this article, we used STT-RAM as a non-volatile region of hybrid cache, although the
proposed ideas can be easily extended to other NVM technologies such as PCRAM and ReRAM.
The main contributions of this article are as follows:

• Upon a hit in the victim cache, we proposed a technique to intelligently place the required
block in the appropriate region of the hybrid cache.
• Another technique is proposed that partitions the victim cache dynamically into two
variable-sized regions to balance the uneven evictions from the different regions of the
hybrid cache.
• Experimental evaluation on a full system simulator GEM-5 [8] shows significant perfor-
mance improvement along with the savings in the execution time over the existing tech-
niques and the baselines.
• We also present a detailed analysis of different configurations of main hybrid cache and
victim cache along with different parameters of the proposed technique.

The rest of the article is organized as follows: Background and motivation are reported in Sec-
tion 2. Related works are illustrated in Section 3. Section 4 presents the proposed techniques for
victim cache. Section 5 discusses the experimental evaluation. Results and analysis are reported
in Section 6. Comparative analysis with different configurations and parameters are presented
in Section 7. We conclude this article in Section 9.

2 BACKGROUND AND MOTIVATION


2.1 STT-RAM
Figure 1 shows the representational view of STT-RAM cell [6]. The STT-RAM cell is made up
of a Magnetic Tunnel Junction (MTJ) and the access transistor. Further, the MTJ is made up of
two ferromagnetic layers viz. reference layer and fixed layer separated by a thin insulating tunnel
barrier made up of MgO. Note that in the MTJ, the magnetization direction of the reference layer
is fixed, and the magnetization direction of the free layer is changed according to spin-polarized
current. Here, the bit stored in the cell is represented by the magnetization direction of these two
layers. In other words, the parallel magnetization represents the low resistance and “0” state of

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:4 S. Agarwal and H. K. Kapoor

Table 2. Percentage Times the Block Placed from Victim Cache to Different Regions of Hybrid Cache

PARSEC v2.1 SPEC CPU 2006 MIBench


Work- loads Mean
Cann Ded Fluid Freq Stream X264 SMix1 SMix2 SMix3 SMix4 MMix1 MMix2 MMix3 MMix4
STT
96.1% 99.8% 89.7% 60.3% 98.1% 53.6% 60.6% 89.3% 80.8% 97.7% 67.9% 78.3% 78.8% 76.9% 80.6%
Placement
SRAM
3.9% 0.2% 10.3% 39.7% 1.9% 46.4% 39.2% 10.7% 19.2% 2.3% 32.1% 21.7% 21.2% 23.1% 19.4%
Placement

STT cell (Figure 1(c)). However, the anti-parallel magnetization direction represents the “1” state
and the high resistance (Figure 1(d)).
The write “0” and “1” operations in the STT cell are performed by establishing the large positive
(Figure 1(b)(2)) and negative (Figure 1(b)(1)) voltage difference between the source and bit line,
respectively. On the other side, the read operation is performed by applying the small voltage
between the source and the bit line, which in turn generates the current that compared with the
reference current to detect the state of STT cell, respectively (Figure 1(b)(2)).

2.2 Victim Cache


The victim cache proposed by Jouppi [14] is used for improving the performance of SRAM-based
main cache by retaining the victims evicted from the main cache. Usually, the victim cache is an
SRAM-based fully associative structure, and it is associated with any level of cache in the multi-
level cache hierarchy. When the block is evicted from the main cache, it is retained into the victim
cache by substituting the LRU block from the victim cache. When a block request (R) is received
from the upper-level cache, the requested block is searched in both the main cache and victim
cache in parallel. If the block is found in the victim cache, the requested block is first placed into
the main cache, and thereafter, the request (R) is served. In case, if there is no invalid entry in the
main cache, the LRU block from the cache set in the main cache is swapped with the requested
block in the victim cache.

2.3 Motivation
The initial block placement policy that we consider for the hybrid cache is the same as the block
placement policy of Read Write Aware Hybrid Cache Architecture (RWHCA) [31]. In particular,
we place all the blocks that arrive on a load miss in the STT region and all the blocks that arrive
on a store miss in the SRAM region. Employment of victim cache with hybrid cache requires good
victim retention policy; and policy to place the blocks to the appropriate region of hybrid cache
when they are moved back from the victim cache. This policy is needed because, upon a hit in the
victim cache, there is a possible placement of the write-intensive victim blocks in the STT region
of hybrid cache, as this depends on the applied replacement policy of hybrid cache, and this may
be the LRU block from the STT region. This placement may incur extra writes to STT, which
may degrade the performance as well as increase energy consumption. Such cases will overcome
the benefits offered by the victim cache associated with the main hybrid cache. Table 2 shows
the percentage times a block is placed in the STT region of main hybrid cache upon a hit in the
victim cache (for different workloads). The conclusion that can be derived from the table is that,
on average, 80.6% of the time block is placed in the STT region. This is due to the fact that there
is a large possibility of LRU victim selection from the STT region as the cache set contains three-
fourths STT ways compared to one-fourth SRAM ways. This motivates us to propose an effective
and intelligent block placement policy to the appropriate regions of the hybrid cache upon a hit
in the victim cache.

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:5

3 RELATED WORKS
This section briefly describes the different state-of-the-art block placement, reconfiguration, by-
passing, wear-leveling, and multi-retention policies in the hybrid (or NVM) cache architecture that
improves performance and energy. In addition, the policies that exploit the victim cache architec-
tures with pure SRAM-based cache are also discussed.
Read Write aware Hybrid Cache Architecture (RWHCA), proposed by Wu et al. [31], places the
block to the different regions of the hybrid cache according to the type of miss. In particular, all the
blocks that arrive due to load miss are placed in the STT region, and all the blocks that arrive due to
store miss are placed in the SRAM region. Any disproportion in the accesses in any of the regions
leads to the migration of blocks from one region to another. A technique called Write Intensity,
proposed by Ahn et al. [4], predicts the placement of block loaded due to load miss to the different
regions of the hybrid cache based on write intensity. Adaptive Placement and Migration (APM)
technique, which categorizes the writes into three categories: Prefetch write, Demand Write, and
Core Write, is proposed in Reference [30]. Here, based on the prediction outcomes, the decision
to either place the block to different regions of the hybrid cache or to bypass the block is taken. A
reuse distance mechanism that decides to bypass the block for an exclusive cache is proposed in
Kim et al. [16]. Other than APM, existing literature have been proposed various bypass approaches
that bypasses the group of blocks through a certain level of a cache is proposed in References [5,
28]. Partition schemes that consider unbalanced writes and the wear-out issue of STT-RAM and
accordingly place the blocks to the different regions are presented in References [20, 21]. Differ-
ent reconfiguration techniques that change the configuration of the hybrid cache at runtime are
presented in References [9, 25]. To improve the lifetime, different wear leveling approaches at the
cache bank level are presented in References [1, 3]. Other than these proposals, the variable reten-
tion schemes or the multi-retention hybrid or NVM-based caches have been proposed in recent
years to improve the cache performance. A dynamic refresh scheme for the multi-retention STT
cache to reduce the costly latency and energy is proposed in Reference [29]. Kuan et al. [18] pre-
sented a dynamic, adaptable retention scheme for the STT-RAM-based L1 cache. Here, based upon
the runtime EDP and miss rate, the applications are mapped to different retention partitions of the
cache. Another policy that makes different dynamic cache clusters with variable retention time for
the LLC is proposed in Reference [19]. Here, the mapping of the application to the different cluster
is based upon the access latency. In all these multi-retention schemes, there is a requirement of a
dynamic or static refresh that incurs extra latency and energy. At the same time, if the block is not
refreshed at the appropriate time, then it may lead to an increase in the miss rate. It impacts the
performance gain obtained by the multi-retention scheme as well as increases the accesses to the
next level of memory. A thrashing-aware block placement approach that places the dirty thrashing
blocks in SRAM and clean thrashing blocks in STT is proposed in Reference [22]. A linefill-based
block placement approach that takes into account the allocation count, miss rate, and NVM write
count during the victim selection is proposed in Reference [10]. Over the above techniques, we
compared our proposed technique with the following state-of-the-art techniques: RWHCA, Write
Intensity, and Adaptive Placement and Migration.
Stiliadis et al. [27] proposed a technique called selective victim caching for pure SRAM-based
cache. The technique made a decision to place the block either in the main cache or victim cache
based upon the predictor. The predictor contains the metadata information for the cache blocks
based upon the past history of use. Recently, Nath et al. [24] presented the use of optimized victim
cache with SRAM-based LLC to improve the performance of the NVM-based hybrid main memory.
A virtual victim cache presented by Khan et al. [15] predicts whether to retain the victim block in
the other set of the cache bank. Here, with each cache set in a cache bank, one partner cache set is

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:6 S. Agarwal and H. K. Kapoor

Fig. 2. Schematic view of victim cache architecture associated with hybrid cache.

associated, which differs by one bit in the set-id position. A comparative study proposed by Zhang
et al. [32] takes into account the performance impact of incorporating the equal-sized victim buffer
with an exclusive cache and the victim cache with an inclusive cache. All the techniques mentioned
above for victim caches are proposed with SRAM-based main cache. These techniques do not take
into account asymmetric behavior of hybrid (NVM-based) caches. As per our knowledge, this is
the first work that exploits the possibilities of victim cache with the NVM-based hybrid cache.

4 PROPOSED VICTIM CACHE ARCHITECTURE


In this work, our main aim is to improve the performance of hybrid cache using a Victim Cache
(VC), at the same time maintain the principles of hybrid cache that control the writes in the STT
region. The supporting victim cache must obey this, and therefore, we propose a policy that decides
the partition in the hybrid cache when a block is moved back to hybrid cache from the victim cache,
called: AVBP.
The theme of victim cache is to retain the most recent victims from the hybrid cache. However,
due to the uneven partition sizes of hybrid cache, the evictions from the hybrid cache may be more
from smaller partition: SRAM or will depend on application behavior. In case, the SRAM partition
of hybrid cache performs more evictions over an interval; these evicted blocks, when moved to
victim cache, will remove the blocks in victim cache belonging to the STT partition of the hybrid
cache. To maintain a balanced mix of victims coming from individual partitions of hybrid cache
to the victim cache, we propose to partition the victim cache to hold blocks coming from each
region. Depending on the increase or decrease in the number of evictions from SRAM, the size of
partitions in victim cache is adjusted at runtime so victims from STT get judicious space in victim
cache. This policy is called RDVCP.

4.1 Architecture
Figure 2 shows the schematic view of victim cache architecture associated with the main hybrid
last level cache architecture. As shown in the figure, the hybrid cache is made up of a large number
of STT ways and the small number of SRAM ways. The tag array (made up of SRAM) of main
hybrid cache contains the tag and the MetaData (MD) information: state information (to maintain
the coherence of LLC block), valid, and two dirty bits for each block in the data array. Note that

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:7

both the dirty bits will be updated (or set) when the block is written back from the upper-level
cache, and the further usage of these dirty bits concerning victim cache and coherence protocol
will be elaborated in the further section (refer to Section 4.2, AVBP). The victim cache is a small
SRAM-based fully associative structure that consists of both tag and data array as a normal cache.
The tag array of victim cache contains the tag and the MetaData (MD) information (a valid bit and
a dirty bit) for each data block entry of victim cache. Note that when the block is evicted from the
main hybrid cache and placed into the victim cache at the location, say, T , the dirty bit at location
T will be updated with the value of the second dirty bit of the main cache. With each data entry in
the data array of victim cache, a single bit: r _bit, is associated. The use of r _bit is to identify the
region of the main hybrid cache from where the block was evicted. Note that, with each entry of
victim cache, there is no need to maintain the coherence information, as the blocks maintained in
the victim cache is the evicted block from the main cache.
ALGORITHM 1: AVBP
1: U LC : Upper Level Cache.
2: LLC : Last Level Cache.
3: Vn : Total number of entries in victim cache.
4: repeat
5: for every request R coming from U LC to LLC and miss in hybrid cache along with hit in victim cache do
6: Let the requested block B found in the victim cache at position m .
7: if R == Read H it then
8: if B .dir ty == 1 then
9: Swap B with LRU of SRAM region of the hybrid cache.
10: else
11: Swap B with LRU of STT region of the hybrid cache.
12: end if
13: else
14: Swap B with LRU of SRAM region of the hybrid cache.  W r it eH it
15: end if
16: end for
17: until the end of the execution

4.2 Access-based Victim Block Placement (AVBP)


This section elaborates our proposed Access-based Victim Block Placement (AVBP) technique that
places the block effectively and intelligently to the appropriate region of hybrid cache when found
in the victim cache.
Operation: The proposed technique is elaborated through Algorithm 1. The algorithm reports
the case when the block is found in the victim cache, and it is to be placed into the appropriate
region of the main hybrid last level cache. To explain the algorithm in an easy way, we consider
a fully associative Victim Cache with Vn number of entries (line 3). For each request coming from
Upper-Level Cache (U LC) to Last Level Cache (LLC), the tag lookup operation is performed in
both the main hybrid cache and victim cache. Upon a hit in the main hybrid cache, the requested
block is normally served as same as the normal cache. However, when the block is found in the
victim cache, at position m (line 6), then according to the type of request, the block is swapped
with appropriate regions of the main hybrid cache as described below:
• Read Hit: In this case, the dirty bit of the requested block B at position m in the victim
cache is examined. If the requested block is found to be dirty, the block B is swapped (or
moved if there is an invalid entry in the SRAM region of hybrid cache) with the LRU block
of the SRAM region in the main hybrid cache. The reason behind putting the block in the
SRAM partition is the multiple prospective future write requests for the requested block B,
as it is already being dirty (lines 8 and 9). However, if the requested block is not dirty, the
block is swapped with the LRU block in the STT region of the main hybrid cache (lines 10
and 11).

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:8 S. Agarwal and H. K. Kapoor

Table 3. Percentage Time SRAM Eviction (SR E ) Greater than the STT Eviction (STE )

Work- PARSEC v2.1 SPEC CPU 2006 MIBench


Mean
loads Cann Ded Fluid Freq Stream X264 SMix1 SMix2 SMix3 SMix4 MMix1 MMix2 MMix3 MMix4
S R E >STE 99.4% 39.6% 19.6% 22.8% 4.31% 32.1% 19.3% 18.5% 13.2% 23.1% 10.7% 17% 20.6% 6.93% 24.8%

• Write Hit: In this scenario, the block is swapped with the LRU victim block of the SRAM
region due to multiple prospective future write requests (lines 13 to 15).
• Hybrid Cache Miss and Victim Cache Miss: In case, if the block is not found in the main
hybrid cache and the victim cache, then it is fetched from the main memory. As per the
placement policy of the main cache, the fetched block is placed in the particular region, and
the LRU block is evicted from that region. The LRU is kept in the victim cache. Note that, in
this case, to make room for the evicted block of hybrid cache, the LRU block of the victim
cache is evicted.
Note that, in most of the traditional cache coherence protocol, when there is a read request for
a dirty block, the block is written back to main memory and transitioned to a shared state (dirty
bit reset to 0). In this case, the first dirty bit of the writeback block will change as per the state of
the block (0 in case of shared state and 1 in case of modified state), and the second dirty bit of the
block remains intact. Further, at the time of the block eviction from the hybrid cache, the status of
the second dirty bit is examined, and it updates the dirty bit of the block in the victim cache.
Limitation: The limitation of AVBP is the larger number of evictions from the SRAM region on
account of the proposed placement policy of hybrid cache and the access behavior of running
applications. In other words, the behavior of applications running on the multiple cores will not
be constant, and it will change over the period. By experimental analysis, we have found that
within an interval of 2M cycles, sometimes the evictions from the SRAM region is greater. Table 3
presents this evidence as the number of times the SRAM region evictions is greater than the STT
region evictions for 2M intervals for the whole execution. As reported in the table, on an average
24.8% of times, the evictions from the SRAM region are greater than the STT region, and these
evicted SRAM blocks will replace the STT blocks of victim cache. This can cause the victims evicted
from the STT-RAM to not stay back longer in the victim cache and get prematurely evicted from
thereself. This motivates us to propose a dynamic region-based victim cache partitioning technique
to control this uneven SRAM evictions behavior and keep judicious space in victim cache for STT
blocks.

4.3 Region-based Dynamic Victim Cache Partitioning (RDVCP)


Main Idea: The key idea of RDVCP is to intelligently partition the victim cache dynamically into
two regions: one for SRAM victims and one for STT victims. The victim evicted from the main
hybrid cache is placed appropriately in one of these victim regions. The sizes of the SRAM victim
region and STT victim region are adjusted at runtime depending on the application pattern. Note
that the dynamic decision for altering the sizes is taken after every predefined interval I .
Operation: We explain the operation of RDVCP through the Algorithm 2. The algorithm de-
scribes the interval wise Region-based Dynamic Victim Cache Partitioning and the placement of
the evicted block from the main hybrid cache to the different regions of victim cache. Note that
while placing the victims, the algorithm is maintaining the allocated partition sizes for the current
interval. Similar to Algorithm 1, the functionality of parameter Vn is the same (line 2 of Algo-
rithm 2). The tunable parameter I is used as a predefined interval for deciding dynamic victim
cache partitions (line 3). The threshold used for altering the allocated partition sizes of the victim

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:9

ALGORITHM 2: RDVCP
1: H CA: Hybrid Cache Architecture
2: Vn : Total number of entries in victim cache.
3: I : Predefined Interval.
4: Bias : Threshold to change the partition size of victim cache.
5: Cur r _S RAM _Evict : Eviction counter that records the number of eviction from SRAM in the current interval.
6: P r ev _S RAM _Evict : Eviction counter that maintains the eviction count from the previous interval.
7: vic _ST T _ways : Number of ways allocated in victim cache for STT region.
8: vic _S RAM _ways : Number of ways allocated in the victim cache for SRAM region.
9: vic _ST T _ways = Vn /2; vic _S RAM _ways = Vn /2
10: max _S RAM _vic = 3Vn /4; min _S RAM _vic = Vn /2
11: Run application for I cycles treating the whole cache as a normal cache and the victims evicted from each region of hybrid cache are
stored in the respective partition of victim cache.
12: repeat
13: for at the end of every interval I do
14: Let δ = Cur r _S RAM _Evict − P r ev _S RAM _Evict
15: Let δ  = P r ev _S RAM _Evict − Cur r _S RAM _Evict
16: x = vic _S RAM _ways + Vn /4; x  = vic _S RAM _ways − Vn /4
17: y = vic _S RAM _ways + Vn /8; y  = vic _S RAM _ways − Vn /8
18: if δ ≥ 2 ∗ Bias then
19: if x ≤ max _S RAM _vic then
20: vic _S RAM _ways = x
21: else if y ≤ max _S RAM _vic then
22: vic _S RAM _ways = y
23: end if
24: else if Bias ≤ δ < 2 ∗ Bias then
25: if y ≤ max _S RAM _vic then
26: vic _S RAM _ways = y
27: end if
28: else if δ  ≥ 2 ∗ Bias then
29: if x  ≥ min _S RAM _vic then
30: vic _S RAM _ways = x 
31: else if y  ≥ min _S RAM _vic then
32: vic _S RAM _ways = y 
33: end if
34: else if Bias ≤ δ  < 2 ∗ Bias then
35: if y  ≥ min _S RAM _vic then
36: vic _S RAM _ways = y 
37: end if
38: end if
39: vic _ST T _ways = Vn − vic _S RAM _ways size.
40: I RV P (vic _ST T _ways, vic _S RAM _ways )
41: end for
42: until the end of the execution
Interval wise Region-based Victim Placement
43: function IRVP(new _ST T _ways , new _S RAM _ways )
44: ex ist _ST T _ways : Existing allocation of STT victim counts.
45: ex ist _S RAM _ways : Existing allocation of SRAM victim counts.
46: for every eviction of block B in H CA do
47: Let the block B evicted from the region P of hybrid cache.
48: if new _P _ways ≤ ex ist _P _ways then
49: Place B to its respective region P of victim cache.
50: else
51: Place B to the other region P  of victim cache.
52: ex ist _P _ways − −; ex ist _P  _ways + +
53: Update r _bit for B .
54: end if
55: end for
56: end function

cache is represented by the parameter Bias 1 (line 4) at the end of each interval. The count of the to-
tal number of evictions from the SRAM region of the main hybrid cache in the current interval and
in the previous interval is represented by the variables Curr _SRAM_Evict and Prev_SRAM_Evict,

1 We have found the value of Bias by empirical analysis by conducting extensive profiling for the different set of values as
reported in Section 7.3.

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:10 S. Agarwal and H. K. Kapoor

respectively (lines 5 and 6). The variables vic_STT _ways and vic_SRAM_ways are used to main-
tain the count of the number of victim ways allocated to the STT-RAM victim region and SRAM
victim region, respectively, for the current interval (lines 7 and 8). Initially, at the beginning of
execution, half of the ways of the victim cache are allocated to each region (line 9). For a block in
victim cache, to identify from which region in the main hybrid cache it came from, we use a bit
called r _bit. If it is set, the block has come from the STT region; else from the SRAM region. We
propose the maximum and minimum (line 10) sizes allowed for each victim region as follows:
• max_SRAM_vic = 3Vn /4, min_SRAM_vic = Vn /2
• max_STT _vic = Vn /2, min_STT _vic = Vn /4
For the initial I cycles of application execution, the victim cache behaves normally with each
block evicted from the main hybrid cache stored in the respective region of the victim cache (line
11). Once the application crosses the I cycles, i.e., at the end of interval, different operations are
performed according to the uneven evictions from the SRAM region of the main hybrid cache:
• The SRAM evictions in the current interval are greater than the previous interval:
In this case, depending on the difference (δ ) in the values of current interval SRAM evic-
tions and the previous interval SRAM evictions, an appropriate increase in the SRAM victim
region is performed. Also, note that increment in the SRAM partition results in the corre-
sponding decrease in STT partition of victim cache (line 39). Specifically, if δ ≥ 2 ∗ Bias, we
increase SRAM part by at most Vn /4 (lines 18 to 20). If the increase in the value of victim
regions violates the maximum constraint, then an increase of Vn /8 is performed (lines 21 to
23). Further, in case, if the partition size is already at the maximum limit, no change in size
is performed.
However, if the difference Bias ≤ δ < 2 ∗ Bias, an increase by Vn /8 is performed to the size
of the SRAM victim region (lines 24 to 26).
• The SRAM evictions in the current interval are less than the previous interval: Using
the same logic as in the above case, a decrease of SRAM victim region size is performed
keeping the minimum size constraint (lines 28 to 38).
Note that the above operations of the dynamic victim cache partitioning are performed at the end
of each interval until the end of application execution (line 42). This decides the new sizes of the
partition for the next interval. In the meanwhile, between the intervals, the evicted block from
the main hybrid cache will be placed to the appropriate regions of the victim cache according to
the new sizes of the victim regions (line 40). If the size of victim region changes, then there may
be a case that SRAM blocks are in the STT region of victim cache and vice versa. To stabilize
the region with the correct block, we proposed an Interval-wise Region-based Victim Placement
(IRVP) algorithm. In the algorithm, the existing status count for each region in the victim cache
is represented by the variables exist_STT _ways and exist_SRAM_ways, respectively (lines 44 and
45). Let the block B be evicted from hybrid cache from region P (the region other than P of hybrid
cache is represented by P ) (lines 46 and 47) to place B in victim cache; the two cases are described
below:
• The new size of the P is less than or equal to the existing count: In this case, the block B
is placed in the region P of the victim cache by victimizing the LRU block from P. Note that,
in this case, for identifying the region of the block for victim selection, the r _bit is used
(lines 48 and 49).
• The new size of the P is greater than the existing count: Here, the block B is placed in
the region P  of the victim cache by evicting the LRU block (at the location, say, T ) from P .

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:11

Fig. 3. Working example of Region-based Dynamic Victim Cache Partitioning (RDVCP).

Once the block B is placed, the respective r _bit at location T is accordingly updated (lines
50 to 54).

Working Example: Figure 3 presents the working example of RDVCP. In the example, an
eight entry full associative victim cache is considered. We set the value of Bias (used to change the
victim cache partition size) to 100. In the example, as per the proposed RDVCP algorithm, during
the runtime, the maximum ways allocated to SRAM victims are set to 6 (3Vn /4), and minimum
ways are set to 4 (Vn /2) in the victim cache. Similarly, the minimum ways allocated to STT victims
are set to 2, and the maximum ways are set to 4.
Initially, when the application execution started at the time (t 0 ), four ways of the victim cache
are allocated to the STT victims (represented with orange box) and four ways to SRAM victim
(shown in the grey box). To demonstrate the example, the victim cache partitioning status at dif-
ferent timestamps is considered. To measure the evictions count from the SRAM partition in the
current and previous intervals, two variables Curr _SR_Evict and Prev_SR_Evict are considered
in the example.
At timestamp t 1 : In this case, the SRAM evictions in the current interval are greater than by 200
(2 ∗ Bias) over the previous interval, the SRAM victim partition in the victim cache is increased by
2 (Vn /4), and the STT victim partition is decreased by 2 (Vn /4).
At timestamp t 2 : The SRAM evictions in the previous interval are greater than the current in-
terval SRAM eviction by 100 (Bias, less than 2 ∗ Bias), the STT victim partition is increased by
1 (Vn /8), and SRAM victim partition is decreased by 1. Similarly, at timestamp t 3 , the opposite
case is seen.
At timestamp t 4 : For the last case, the previous interval SRAM eviction is greater than the current
interval one, the STT partition is increased by 2, and SRAM partition is decreased by 2.
In the other special cases (example, at t 2 and t 2 ), at different timestamps, if the evictions count
at different intervals is applied to the victim cache partition constraint limit, then, in this case, no
operation is performed.

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:12 S. Agarwal and H. K. Kapoor

Fig. 4. Working example of Interval-wise Region-based Victim Placement: (a) new > existinд (b) new ≤
existinд.

However, in the other cases (example: at t 3 and t 3) where the eviction count is greater than
the 200 (2 ∗ Bias) but the victim partition limit is between the maximum and minimum allowed
partition limit, then, in this case, an appropriate increase and decrease (Vn /8) in the partition size
is performed by the RDVCP.
Figure 4 shows the working example of Interval-wise Region-based Victim Placement. In the
example, an eight entry fully associative victim cache is considered that consists of the victim data
block (represented by SR for SRAM region block and ST for STT region block) and its associated
r _bit. Two cases are considered to demonstrate the method. In the first case (a), the existing count
of the SRAM victim region blocks in the victim cache is less than the new SRAM victim region
count in the current interval. In such conditions, when the block SR 6 is evicted from the SRAM
region of the hybrid LLC, then the evicted block is placed in the STT victim region of the victim
cache at LRU position four and r _bit is made to zero. However, in the second case (b), the existing
status count of the SRAM victim region block in the victim cache is greater than the new SRAM
victim region count of the interval. Here, when the block SR 6 is evicted from the LLC, it will be
placed in the respective SRAM region by evicting its LRU blocks, say, at location 0, and r _bit
remains the same.
Figure 5 shows the summarized working flow diagram of proposed approaches: AVBP and
RDVCP during application execution in the CMP system. Figure 5(a) shows the approach used
during the RDVCP, while Figure 5(b) presents the working of AVBP, and Figure 5(c) shows the
interval-wise region-based victim placement.

5 EXPERIMENTAL EVALUATION
5.1 Simulator Setup
We implemented our proposed approaches on a full system simulator GEM-5 [8]. In the simu-
lator, we used the Ruby memory module with MESI CMP-based cache controller. Table 4 shows
the system parameters used in our simulations. The experiments are conducted on the different
configurations of Hybrid shared LLCs (L2 cache) and with different victim cache sizes. The timing,
energy, and area parameters of these configurations are obtained by CACTI [23] and NVSIM [11]

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:13

Fig. 5. (a) Working flowchart summarizing the proposed schemes: RDVCP and AVBP. (b) Flowchart repre-
senting AVBP. (c) Flowchart representing the IRVP.

Table 4. System Parameters

Components Parameters
Processor 2 Ghz, Quad Core, X86
Private, 32 KB SRAM Split I/D caches, 4-way set
L1 Cache associative cache, 64 B block, 1-cycle latency, LRU,
writeback policy
Shared, 16-way set associative cache (12-way STT-RAM
L2 Cache
and 4-way SRAM), 64 B block, LRU, writeback policy
Victim Cache SRAM, fully associative, 64 B block, LRU policy
Main Memory 2 GB, 160 cycle Latency
Protocol MESI CMP Directory

simulator at 32 nm technology node. Table 5 reports these values. Note that we have also consid-
ered the energy consumption (including static and dynamic) of victim cache in our experiments.
We perform the comparison analysis with different baselines, existing approaches, and with the
different variety of proposed methods as mentioned below:

• Base pure STT and Base pure SRAM: The baseline architecture with no data placement
policy and uses Least Recently Used (LRU) as a replacement policy.
• Base HCA: The baseline hybrid cache architecture with integrated victim cache having no
data placement policy and uses LRU as a replacement technique.
• Read Write aware Hybrid Cache Architecture (RWHCA) [31] (denoted by R): An
existing data placement approach that places the data blocks according to the types of
accesses to the different regions of the hybrid cache. Here, 2-bit counters are used to

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:14 S. Agarwal and H. K. Kapoor

Table 5. Timing and Energy Parameters for Different LLC and Victim Cache (VC) Configurations

Static Power Read Energy Write Read Write


LLC/VC Size LLC Type
(mW) (nJ) Energy (nJ) Latency (ns) Latency (ns)
SRAM 138.7 0.116 0.116 1.874 1.874
LLC Size = 1MB STT 59.44 0.188 2.117 2.117 11.34
Hybrid 79.25 0.188/0.116 2.117/0.116 2.117/1.874 11.34/1.874
SRAM 282.2 0.221 0.221 2.00 2.00
LLC Size = 2MB STT 92.68 0.285 2.147 2.336 11.54
Hybrid 140.1 0.285/0.221 2.147/0.221 2.336/2.00 2.336/2.00
SRAM 554.3 0.330 0.330 2.180 2.180
LLC Size = 4MB STT 229.6 0.346 2.270 2.577 11.94
Hybrid 310.8 0.346/0.330 2.270/0.330 2.577/2.180 11.94/2.180
SRAM 1,094.5 0.432 0.432 2.043 2.043
LLC Size = 8MB STT 342.0 0.693 2.355 3.177 12.41
Hybrid 530.1 0.693/0.432 2.355/0.432 3.17/2.043 12.41/2.043
Victim Cache
VC Size = 16 Entries 2.93 0.007 0.007 0.240 0.240
VC Size = 32 Entries SRAM 5.48 0.009 0.009 0.307 0.307
VC Size = 64 Entries 10.44 0.014 0.014 0.416 0.416

capture the accesses of the block. When there is a disproportion in the accesses in any of
the regions, it leads to the migration of blocks.
• Write Intensity (WI) [4] (denoted by W): A predictor-based data placement approach that
places the data block to different regions of hybrid cache based upon the write intensity.
Here, the predictor is composed of 1,024 entries that consist of a 10-bit hashed address field
(used for indexing the predictor), a valid bit, and a 3-bit state field. To capture the write
accesses, with each entry of Hybrid LLC, 10-bit trigger instruction and 2-bit counter are
incorporated. The write intensity threshold to place the block in the SRAM region is set to 4.
• Adaptive Placement and Migration (APM) [30] (denoted by A): The policy categorizes
the writes into three categories: demand-write, core-write, and prefetch-write. Depending
upon the classes of writes and the dead block and write-burst prediction table results, the
decision to either bypass the block or to place the different blocks at the different regions
in the hybrid cache is taken. To implement APM, we have considered a 4,096-entry dead
block and write-burst prediction tables that comprise the 2-bit saturating counter. Besides,
a 16-way pattern simulator (that consists of 16 tag fields (16-bit), 16 read PC fields (16-bit), 4
write PC fields (16-bit), 16 Read LRU fields, 4 write LRU fields, and 16 valid bits) is associated
with each of the 32 cache-set in a cache bank. Also, with each entry of the hybrid LLC, a
prediction bit is incorporated to represent the dead/write-burst predicted status of the cache
block. In the simulations, the write-burst and dead block prediction threshold is set to 2.
• HCA with Victim Cache and Placement policy (HCAVP) (denoted by O): The baseline
hybrid cache that places the data to different regions upon LLC miss as same as RWHCA.
In this variety, the only difference with RWHCA is that there are no counters associated
with the blocks, and there is no migration process.
• HCAVP with AVBP: (denoted by P): Hybrid cache architecture with the full support of
placement policy that includes the placement of block from victim cache upon hit to differ-
ent regions of the hybrid cache.

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:15

Table 6. Benchmarks Used for Evaluation

Benchmark
Benchmarks
suite
Canneal (Cann), Dedup (Ded), Fluidanimate (Fluid),
PARSEC v2.1 Freqmine (Freq), Streamcluster (Stream), X264
SMix1: milc, hmmer, bzip2, soplex (Mid WBKI)
SMix2: dealii, sjeng, h264ref, tonto (Low WBKI)
SPEC CPU2006
SMix3: gobmk, tonto, sjeng, namd (High WBKI)
SMix4: calculix, astar, dealII, h264ref (Random WBKI)
MMix1: typeset, mad, jpeg, susan (Low WBKI)
MMix2: blowfish, IFFT, bitcount, patricia (Mid WBKI)
MiBench
MMix3: CRC32, qsort, GSM enc., ADPCM enc. (High WBKI)
MMix4: jpeg, basicmath, IFFT, blowfish (Random WBKI)

• HCAVP with RDVCP: (denoted by Q): Hybrid Cache Architecture that includes the ini-
tial block placement from main memory upon LLC miss and the proposed region-based
dynamic victim cache partitioning approach.
• HCAVP with AVBP and RDVCP (denoted by S): Hybrid cache architecture integrated
with all the proposed approaches.
In our simulations, we have considered the extra time taken for the accesses and searching in
the victim cache. In particular, five cycles are taken during the block swapping between the main
hybrid cache and the victim cache, and one cycle is taken for the searching of the block in the victim
cache. Note that this is the extra cycle, as the search has already begun in parallel with the main
cache. The writing and the placement of the evicted LRU blocks from the hybrid cache to the victim
cache will not fall into the critical path, as the victim cache is an independent structure and is not
affected when a new request comes to the main cache. We consider the 42-bit swap buffers in our
experiments for the tag swap operations. Besides, to measure the evictions from the SRAM region,
we added two 12-bit counters, and to maintain the existing count and new count of the victim
region, four 5-bit counters are used. We have also calculated and consider the energy consumption
of victim cache in our experiments. The energy values are calculated by using NVSIM [11] tool.

5.2 Benchmarks
The proposed techniques are examined by using both multi-threaded: PARSEC [7], multi-
programmed: SPEC CPU 2006 [13] and embedded: MIBench [12] benchmark suites. Six bench-
marks with medium input set are taken from PARSEC benchmarks, 12 benchmarks with ref
input set are used from SPEC, and 13 benchmarks with large input sets are used from the
MIBench. Table 6 lists the names of the PARSEC benchmarks with SPEC (SMix) and MIBench
(MMix) mixes of applications. Note that these mixes are composed by considering the Write-Backs
per Kilo Instruction (WBKI) of each individual benchmark. We run each SPEC multi-programmed
workload (SMix) for 1B instructions (such that each benchmark will cover at least 250M instruc-
tions) by warming up them by at least 250M instructions. Whereas, due to limited working set
size of the MIBench benchmarks, each mix (MMix) is simulated for 500M instructions by warming
them up by at least 250M instructions. However, the multi-threaded workload runs for the whole
Region Of Interest (ROI).

6 RESULTS AND ANALYSIS


Out of the different configurations of LLCs and victim cache sizes, we select 4 MB 16-way set asso-
ciative last level hybrid cache and 32-way (Vn ) fully associative victim cache. We choose the value

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:16 S. Agarwal and H. K. Kapoor

Fig. 6. Normalized LLC Writes of O and P against RWHCA (R), WI (W), and APM (A) (lesser in STT is better).

Table 7. Percentage Savings in Write Accesses (More Is Better)

Policy Region O P Row


STT 34.3% 38.3% 1
RWHCA SRAM −40.4% −42.7% 2
Total 6.34% 6.80% 3
STT −20.7% −13.5% 4
WI SRAM 24.1% 22.8% 5
Total 7.5% 7.9% 6
STT 15.9% 20.9% 7
APM SRAM 45.9% 45.1% 8
Total 35.5% 35.8% 9
STT 49.7% 52.7% 10
Base HCA SRAM −143.1% −147.1% 11
Total 0.58% 1.05% 12
STT 60.4% 62.8% 13
Base STT
Total −1.52% −1.05% 14
SRAM 47.8% 47% 15
Base SRAM
Total −0.3% 0.2% 16

of Bias to 500 and I to 1M cycles for RDVCP. The reasons behind choosing these values and cache
configurations are explained in the later section, where we present brief results with the differ-
ent cache configurations and parameter values of RDVCP. We show the results on the following
metrics: Write Savings, CPI Improvement, Execution time, Miss Rate, and Energy Overhead.

6.1 Write Accesses


Figure 6 presents the normalized write accesses against RWHCA, WI, and APM. Table 7 presents
the percentage savings in write accesses by the techniques: O and P over RWHCA, WI, APM, and
the Base STT, SRAM, and HCA. Note that the negative values in the table (rows 2, 4, 11, 14, and
16) imply the increase in writes.

• RWHCA: With respect to RWHCA, the savings in write accesses by policy P (38.3%) is
basically due to less writebacks by AVBP (row 1). However, a large number of writebacks
are redirected to the SRAM region (−42.7%) (row 2). Note that the improvement in the total

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:17

write accesses (row 3) is basically due to the improvement in STT region, as we have 3:1
ratio in the hybrid cache of STT versus SRAM.
• WI: With WI, the degradation in the number of write accesses by Policy P (−13.5%) (row 4)
is due to the placement of a considerable amount of load miss blocks in the SRAM region
by the Predictor. Whereas, as same as RWHCA, all the store miss blocks are placed in the
SRAM region. As can be seen from row 5, we have obtained the savings in the write access
(22.8%) with respect to the SRAM region. These savings in the write results in the overall
savings in Policy O (7.5%) and P (7.9%) over WI (row 6).
• APM: The policy APM reduces the writebacks in the STT region significantly, but the writes
incurred due to: (1) large number of blocks exchange between STT and SRAM for the write-
burst and live predicted blocks, and (2) allocations of the blocks loaded due to demand
misses/writes in the STT region overcome the obtained writeback savings by APM. All
these facts lead to the write accesses improvement by the policy O and P for STT (row 7),
SRAM (row 8) and the overall (row 9). In addition to the above analyses, with APM, we
have observed that the benchmarks having a small working set size (like MIBench) benefit
in terms of write accesses.
• Base HCA: Compared to Base HCA, the savings in the write accesses for STT region (52.7%)
(row 10) and the increase in the write for the SRAM region (−147.1%) (row 11) is due to the
appropriate placement of the different types of blocks in the different regions of the hybrid
cache.
• Base STT/SRAM: Over the baseline STT and SRAM, the region-wise significant gain is
observed (row 13 and row 15) by the proposed approach O and P. But at the same time, due
to proper placement by the AVBP, the proposed technique maintains the same number of
writes with the marginal increase (rows 14 and 16).

Note that the write accesses results are not discussed for the policy Q and S, as they maintain
results similar to O and P. Policies Q and S are policies for victim cache optimizations, and they do
not affect STT writes.
Along with these results, to show the effectiveness of the placement approach of AVBP, we
calculate the percentage of reads and writes counts for the blocks that are placed to the different
regions of hybrid cache when found in the victim cache. Table 8 presents such an analysis. In
the analysis, we categorize the placed block in the hybrid cache from the victim cache into three
categories: dirty load access block (placed in SRAM), store access block (placed in SRAM), and
clean load access block (placed in STT). By experimental analysis, we found that on an average
46% (including 11% blocks are dirty load access block) of the blocks from the victim cache are
placed in the SRAM region and rest 54% blocks (clean load access block) are placed in STT region.
Effect on endurance: With a lesser number of writebacks, our proposed technique improves
the endurance (measured in terms of the number of writes in the STT region, write traffic reduc-
tion) of the non-volatile region of the cache by savings in the writebacks. In particular, the savings
in the writeback traffic for P in the STT region are 40.5%, respectively, over RWHCA. However,
no significant gain is observed in writeback traffic with WI and APM due to their applied block
placement approach.

6.2 Performance Improvement


Figure 7 shows the percentage improvement in CPI with respect to Base STT. Table 9 reports the
percentage improvement values in CPI against different existing techniques: RWHCA, WI, and
APM and baselines: Base STT, Base SRAM, and Base HCA. With RWHCA, the CPI improvement
of 3.80% (row 1) is due to applied victim cache and the proposed policies and the lesser number of

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:18

Table 8. Read and Write Count Percentage for Different Categories of Blocks Placed in Hybrid Cache from Victim Cache (R: Read, W: Write)

Work- PARSEC v2.1 SPEC CPU 2006 MIBench Mean


loads Cann Ded Fluid Freq Stream X264 SMix1 SMix2 SMix3 SMix4 MMix1 MMix2 MMix3 MMix4
R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%) R (%) W (%)
Dirty
1.2% 11.2% 3.7% 27.2% 0.5% 0.2% 0.1% 0.15% 3.2% 14.8% 0.1% 0.8% 0.2% 2.2% 4.4% 8.5% 0.7% 2% 0.8% 5.6% 2.1% 3.7% 2.4% 13% 0.9% 26.7% 1.3% 22.7% 1.5% 9.92%
Load
Clean
97.5% 67.7% 49% 36.2% 99% 993.% 58.2% 30.5% 82.3% 3.8% 48.8% 0.7% 25.3% 48.9% 76.8% 64% 40.8% 25.7% 62% 59.3% 48.8% 6.3% 42.1% 0.4% 41.5% 0.53% 87% 6.8% 61.4% 32.1%
Load
Store 1.3% 21% 47.3% 36.5% 0.5% 0.4% 41.6% 69.3% 14.4% 81.3% 51% 98.5% 74.4% 49% 18.8% 27.5% 58.4% 72.2% 37.1% 35.1% 49% 90% 55.5% 86.6% 57.6% 72.7% 11.7% 70.4% 37% 58%

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
S. Agarwal and H. K. Kapoor
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:19

Fig. 7. Percentage Improvement in CPI of R, W, A, O, P, Q, and S against Base STT (more is better).

Table 9. Percentage Improvements in CPI (More is Better)

Policy O P Q S Row
RWHCA 1.50% 2.46% 2.40% 3.80% 1
WI 1.1% 2.1% 2% 3.5% 2
APM 0.9% 1.9% 1.8% 3.33% 3
Base HCA 2.28% 3.23% 3.17% 4.63% 4
Base STT 3.77% 4.70% 4.63% 6.1% 5
Base SRAM −3.53% −2.46% −2.54% −0.9% 6

write operations (refer to rows 1 and 3 of Table 7). With WI, the overall CPI improvement of 3.50%
(row 2) is observed by the proposed approach. This improvement is basically due to the applied
block placement approach in WI that allocates a large number of blocks in the limited sized SRAM
(including all store missed blocks and a considerable amount of the load missed blocks) region of
hybrid cache, which in turn increases the miss rate in WI (refer to Figure 10 and row 2 of Table 12).
In particular, with WI, even though the write access savings are observed in the STT, due to larger
blocks allocation in the SRAM region, the miss rate is increased and thus affects the performance.
With respect to APM, the overall improvement of 3.33% in CPI is obtained with the proposed
approach (row 3). The gain in CPI is due to a large number of overall writes due to blocks exchange
between the STT and SRAM (refer to row 9 of Table 7), which in turn increases the miss rate in
APM (refer to Figure 10 and row 2 of Table 12) and affects the performance. Note that with all the
existing approaches, the CPI gain and the miss rate improvement (refer to Section 6.5) is due to
the applied victim cache and its proposed optimization policies for the hybrid cache. Compared to
Base-HCA and Base-STT (rows 4 and 5), the improvements (4.63% and 6.1%) are due to the applied
placement policy, the victim cache with their proposed optimization techniques, and large savings
in the write access with respect to Base STT. The performance gap between Base STT and Base
SRAM is 6.86%. By using proper placement approach in the hybrid cache and by employing victim
cache with their optimizations, the performance degradation is brought down by 0.9% (row 6).

6.3 Execution Time


The percentage improvement in execution time against Base STT is shown in Figure 8. Table 10
presents these improvement values against the existing techniques: RWHCA, WI, and APM and
the baselines HCA, STT, and SRAM. The improvements in the execution time (3.03% to 5.75%)
(rows 1 to 5) are due to the improvements in the CPI and the reduction in the miss rate that in

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:20 S. Agarwal and H. K. Kapoor

Fig. 8. Percentage improvement in execution time of R, W, A, O, P, Q, and S against Base STT (more is better).

Table 10. Percentage Improvement in Execution


Time (More is Better)

Policy O P Q S Row
RWHCA 1.81% 2.62% 2.43% 3.62% 1
WI 1.4% 2.23% 2% 3.22% 2
APM 1.18% 2.02% 1.81% 3.03% 3
Base HCA 1.85% 2.70% 2.50% 3.70% 4
Base STT 4.01% 4.80% 4.61% 5.75% 5
Base SRAM −3.28% −2.36% −2.59% −1.25% 6

Fig. 9. Normalized LLC energy consumption of O and P against RWHCA (R), WI (W), and APM (A) (less is
better).

turn decreases the main memory accesses. The execution time gap between the Base STT and base
SRAM is 7%. By employing our proposed optimized victim cache with hybrid cache, the execution
time gap is brought down to 1.25%.
Thus, by the use of Base pure STT as LLC, there is a degradation in performance and execution
time due to the costly write operations. This overhead can be overcome by the use of RWHCA,
WI, and APM. However, with the proper placement policy in the hybrid cache and by the use of
victim cache with the proposed policies, the performance can be improved further.

6.4 Energy Consumption


Figure 9 shows the energy consumption by the proposed techniques normalized with respect to
Base STT. Table 11 reports these improvement values. Note that negative values (rows 2, 4, 6, 7,
11, and 16) in the table imply the increase of energy. The respective improvement and degradation
are due to the following reasons:

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:21

Table 11. Percentage Improvement in the Energy


Consumption (More Is Better)

Policy Region O P Row


STT 24.32% 26.33% 1
RWHCA SRAM −68.5% −70.2% 2
Total 13.94% 15.3% 3
STT −23.9% −20.6% 4
WI SRAM 35.2% 34.5% 5
Total −5.87% −4.25% 6
STT −19.3% −16.1% 7
APM SRAM 49.1% 48.5% 8
Total 8.83% 10.23% 9
STT 29.4% 31.3% 10
Base HCA SRAM −17.8% −19% 11
Total 21.4% 22.6% 12
STT 30.7% 31.7% 13
Base STT/ SRAM SRAM 41.1% 41.6% 14
Total 40.2% 40.8% 15

• RWHCA: Over RWHCA, the improvement (rows 1 and 3) of 15.3% in energy consumption
is basically due to less write operations (due to AVBP) (as evident from Table 7) and no
migrations (that incurs extra energy) in the proposed techniques. However, the increase in
energy consumption in the SRAM region is due to large write accesses (as shown in Table 7).
• WI: As can be seen from row 4, our proposed policies (O and P) increase the energy con-
sumption (−24% and −20.6%) over WI, because WI has savings in the write accesses to
the STT region (as evident from row 4 of Table 7) on account of its prediction-based block
placement. However, our policies save the write accesses in the SRAM region (refer to row 5
of Table 7). Hence, in the SRAM region, an energy improvement of 34.5% (row 5) is observed.
Overall, the energy penalty over WI is between 4% to 6% (row 6).
• APM: Over APM, the energy consumption is more (−19.3% and 16.1% in row 7) by proposed
policies (O and P), because APM has a lesser number of reads and writes in the STT region.
The lesser accesses in STT are due to the considerable number of block migrations from
STT to the SRAM region. However, in the proposed policies, due to the lesser allocations in
the SRAM region and the judicious block placement, there are the large energy savings in
the SRAM region (row 8), and thus, the overall saving (row 9) is observed.
• Base HCA: Compared to Base HCA, the energy improvements (31.3% in row 10 and 22.6%
in row 12) are due to the appropriate placement of blocks in the different regions. The same
reason is applied to the energy increase in SRAM (row 11).
• Base STT/SRAM: With Base STT-RAM/SRAM, the energy improvement values given at
row 13 (31.7%) is the dynamic energy improvement against the baseline STT. However, the
values given at row 14 (41.6%) and row 15 (40.8%) are against the SRAM for static energy
and total energy consumption.
Note that we have also considered the energy consumption due to victim cache (both static and
dynamic energy) in our calculations. Note that the energy consumption results are not discussed
for the policies Q and S, as they maintain the same number of write accesses with O and P in the
main hybrid cache.

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:22 S. Agarwal and H. K. Kapoor

Fig. 10. Percentage increase in miss rate for R, W, A, O, P, Q, and S against Base STT (less is better).

Table 12. Percentage Improvement in Miss Rate (More Is Better)

Policy O P Q S Row
RWHCA 6.76% 5.94% 7.76% 7.90% 1
WI 21.2% 20.37% 21.7% 22.32% 2
APM 25.3% 24.4% 25.8% 26.4% 3
Base HCA −0.14% −0.95% 0.38% 1.00% 4
Base STT 1.17% 0.35% 1.70% 2.30% 5
Base SRAM 0.97% 0.15% 1.48% 2.10% 6

6.5 Miss Rate Improvement


The miss rate improvement by the proposed techniques against the Base STT is given in
Figure 10. Table 12 reports the improvement percentages in miss rate by the proposed varieties
against the Base STT, SRAM, HCA, RWHCA, WI, and APM. We observe an improvement of 7.90%
in the miss rate with the existing technique RWHCA (row 1) due to the applied victim cache and
its proposed optimized policies. Over WI, the miss rate improvements in the range of 20.37%–
22.32% (row 2) are obtained, because WI does a large number of block allocations (including all
store missed blocks and a considerable amount of load missed blocks) in the limited-sized SRAM
region leading to increase in misses. With APM, the large miss rate improvements in the range
of 24.4%–26.4% (row 4) are obtained. This is because APM uses a prediction-based bypass method
and causes a large number of block migrations between STT to SRAM, which in turn increases
the premature invalidation of SRAM blocks. However, the proposed technique maintains the same
miss rate with all baselines that shows the effect of associating the victim cache with the main
hybrid cache (rows 4 to 6). In other words, we can have policies to reduce writes in STT (e.g.,
RWHCA, WI, and APM) and use victim cache to make up for degradation. Besides, our intelligent
block movement policies also control the write endurance of hybrid cache architecture.

6.6 Storage and Area Overhead


In the proposed technique along with the victim cache, we incorporate a single bit r _bit with each
entry of victim cache along with two 42-bit swap buffers used to facilitate the tag swapping process
between the main hybrid cache and the victim cache. In addition, we used two 12-bit counters and
four 5-bit counters to maintain the evictions and the count of the victim region for RDVCP and
the additional dirty bit with each cache entry of the main cache for AVBP. All these constitute
the storage overhead percentage of 0.23% with respect to Base pure STT (having no victim cache
associated) for 32 entry victim cache. Table 13 summarizes the storage and area overhead analysis
between existing architectures, baseline, and the proposed architecture. Note that negative values

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:23

Table 13. Storage and Area Overhead Comparison Analysis

Policy Storage Overhead Area Overhead


Base HCA 0.23% 0.71%
RWHCA −0.13% −0.331%
WI −1.93% −6.54%
APM −0.23% −4.92%

Table 14. Comparative Analysis for Different LLC Capacity

Write Access Exe. Time Miss Rate Energy Imp.


LLC Size CPI Imp (%)
Imp. (%) Imp. (%) Imp. (%) (%)
1MB 11.5% −4.72% 11.21% 3.20% 37.3%
2MB 8.5% −1.40% 8.30% 2.86% 37.1%
4MB 6.06% −1.13% 5.75% 2.30% 31.8%
8MB 4.15% −0.53% 3.98% 1.43% 29.1%

Table 15. Comparative Analysis for Different Victim Cache Sizes (Vn )

Victim size CPI Imp. Write Access Exe. Time Miss Rate Energy
(Vn ) (%) Imp. (%) Imp. (%) Imp. (%) Imp. (%)
16 Entries 5.1% −1.41% 4.91% 0.80% 32.6%
32 Entries 6.06% −1.13% 5.75% 2.30% 31.8%
64 Entries 6.60% −1.38% 6.50% 4% 31%

in the table imply larger area occupancy or the large storage needed to implement the existing
architectures.

7 COMPARATIVE ANALYSIS FOR PARAMETERS


In addition to the results presented in the previous section, we also experimented on different
configurations of the main LLC hybrid cache and victim cache and with different values for the
parameters of the proposed technique: RDVCP. In this section, we show the effect of various met-
rics in comparison to the chosen parameters. The values are given with respect to Base STT.

7.1 Change in LLC Size


Table 14 reports the different metric values for different LLC capacities with the proposed scheme.
Change in the capacity affects the residency of blocks in the LLC. Smaller size results in less res-
idency of the block compared to the larger size. As can be seen from the table, the victim cache
associated with a small cache would be more beneficial (evident from CPI improvement) as com-
pared to the large cache. The reason is that with a small cache, the victim cache holds blocks
that are evicted prematurely without finishing their lifetime. Thus, the embedded system having
a small capacity main cache can benefit from this proposal.

7.2 Change in Victim Cache Entries (Vn )


Table 15 reports the different metric values for different victim cache entries (Vn ). Change in the
size of the victim cache affects its capability to retain the victim block from the main hybrid cache.

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:24 S. Agarwal and H. K. Kapoor

Table 16. Comparative Analysis for Different Bias

Write Access Exe. Time Miss Rate Energy Imp.


Bias CPI Imp. (%)
Imp. (%) Imp. (%) Imp. (%) (%)
100 5.2% −1.28% 5.19% 1.57% 31.8%
250 5.52% −1.68% 5.50% 1.72% 31.4%
500 6.06% −1.13% 5.75% 2.30% 31.8%
1000 5.61% −2.05% 5.47% 1.20% 30.25%

Table 17. Comparative Analysis for Different Intervals

Write Access Exe. Time Miss Rate Energy Imp.


Interval (I) CPI Imp. (%)
Imp. (%) Imp. (%) Imp. (%) (%)
I = 0.5M 6.48% −2.8% 6.22% 2.71% 30.2%
I = 1M 6.06% −1.13% 5.75% 2.30% 31.8%
I = 2M 5.31% −2.07% 5.02% 1.18% 30.8%

With the larger size of victim cache, the residency of the victim block is longer; thereby, it reduces
the miss rate more than the victim cache with the small number of entries. However, the large
victim cache requires a large number of tag comparisons, which incurs extra overhead in terms of
latency and execution time. Thus, a careful selection of victim cache size will make the hardware
efficient.

7.3 Change of Bias


The change of bias with different metric values is reported in Table 16. Change of Bias affects
the change of victim region partition size. The smaller bias value results in a frequent change
in partition size. It thereby creates instability in the different regions of the victim cache (along
with lesser CPI and execution time improvement). However, large bias value causes less frequent
reconfiguration of victim cache partition. This results in a large number of eviction from the SRAM
partition of the main hybrid cache, which is not able to remain in victim cache, leading to an
increase in the miss rate.

7.4 Change of Interval (I )


Table 17 shows the comparative analysis for distinct interval values. The interval values affect the
frequency of change in the partition size of the victim cache for the different regions of the main
hybrid cache. With a large interval, the frequency of change in the partition is smaller, thereby
increases the eviction from the SRAM partition of the main hybrid and reduces the miss rate
improvement. However, smaller interval results in the frequent change in the partition size, which
creates instability in the different regions of the victim cache, thereby increasing the number of
write accesses and energy in the hybrid cache.
Thus, the careful selection of Interval (I ) and Bias values will result in more efficient use of
victim cache.

8 IMPACT OF APPLICATION CHARACTERISTICS ON THE PROPOSED APPROACH


Different applications have different read and write characteristics. This section elaborates on
the effect of the read- and write-intensive applications on the proposed approach. To quantify
the impact of application characteristics, one read and one write application are selected from

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:25

Table 18. Impact of Application Characteristics on AVBP

Application STT SRAM STT SRAM STT SRAM


Workload Row
Characteristics Plcmnt Plcmnt Read Read Write Write
X264 93.2% 6.8% 48.9% 51.1% 0.65% 99.3% 1
Read Intensive SMix2 57.8% 42.2% 76.8% 23.2% 64% 36% 2
MMix2 23.2% 76.8% 42.1% 57.9% 0.4% 99.4% 3
Dedup 37.5% 62.5% 49% 51% 36.2% 63.8% 4
Write Intensive SMix3 31.6% 68.4% 40.9% 59.1% 25.7% 74.3% 5
MMix3 35.5% 64.5% 41.5% 58.5% 0.53% 99.4% 6

each benchmark suite. In particular, we select Dedup (write-intensive) and x264 (read-intensive)
from PARSEC, SMix3 (write-intensive) and SMix2 (read-intensive) from SPEC, and MMix3 (write-
intensive) and MMix2 (read-intensive) from the MIBench. Table 18 presents the effect of AVBP on
the read- and write-intensive applications. As can be seen from the table, our policy: AVBP made
judicious allocations in the different regions of the hybrid cache. Without AVBP, it is evident from
Table 2, most of the writes are entertained by the STT region, as most of the time blocks from
the victim cache are placed in the STT region of hybrid cache. For instance, without AVBP, with
a Dedup workload, 99.8% of the blocks are placed in the STT region. With AVBP, 37.5% times, the
blocks are placed in the STT (as Dedup comes into the write-intensive workload) region. In particu-
lar, with AVBP, the SRAM region incurs writes in the range of 63.8%–99.4% for the write-intensive
workloads and 36%–99.4% for the read-intensive workloads. Hence, ABVP makes sure that the
principle over which the hybrid cache is built is preserved. Note that the policy RDVCP is for the
victim cache optimization, and it does not have much impact on the application characteristics.

9 CONCLUSION
Hybrid Cache Architecture (made using SRAM and STT-RAM) is considered as a viable option to
replace conventional SRAM-based cache. In such an architecture, the placement of the blocks in
the different regions is a key challenging issue for energy efficiency and write endurance. However,
despite the energy saving, the performance benefit obtained from such an architecture is not as
expected on account of the increased miss rate due to a large number of write-intensive blocks
placed in the limited-sized SRAM partition. To mitigate this, we associate a victim cache with the
hybrid cache that is used to retain the victim blocks evicted from the main cache. With each miss
in the main hybrid cache, the victim cache is searched. Upon hit in the victim cache, we proposed
a policy that effectively places the block to the appropriate region of the hybrid cache based on the
type of request and the victim block’s dirty bit status. We also proposed a dynamic region-based
victim cache partition technique to manage the runtime load and uneven evictions from the SRAM
region of the main hybrid cache. The partitioning technique enables the victim cache to hold the
victim dedicated to each region thereby; it increases the possibility of caching the most recently
used blocks evicted from the SRAM as well as STT partition of the main hybrid cache. Note that
the main aim of the proposed partitioning technique is to improve the efficacy of the victim cache
to store the appropriate number of blocks from each region.
To measure the efficacy of the proposed technique, we compared our proposed method with
three existing methods and with different baselines. Experimental evaluation on a full system
simulator shows that the performance gain ranges from 4.2% to 11.5% for different last level cache
sizes. Also, our proposed technique improves the write endurance by reducing the writeback

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
5:26 S. Agarwal and H. K. Kapoor

traffic in the STT region by 40.5% over RWHCA. Thus, the effective use of victim cache with the
main hybrid cache aids in improving the performance and makes the hardware overall efficient.

REFERENCES
[1] S. Agarwal and H. K. Kapoor. 2017. Lifetime enhancement of non-volatile caches by exploiting dynamic associativity
management techniques. In Proceedings of the IFIP/IEEE International Conference on Very Large Scale Integration-
System on a Chip. Springer, 46–71.
[2] S. Agarwal and H. K. Kapoor. 2018. Reuse-distance-aware write-intensity prediction of dataless entries for energy-
efficient hybrid caches. IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 26, 10 (2018), 1881–1894.
[3] S. Agarwal and H. K. Kapoor. 2019. Improving the lifetime of non-volatile cache by write restriction. IEEE Trans.
Comput. 68, 9 (2019), 1297–1312.
[4] Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. 2013. Write intensity prediction for energy-efficient non-volatile
caches. In Proceedings of the IEEE International Symposium on Low Power Electronics and Design. 223–228.
[5] Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. 2014. DASCA: Dead write prediction assisted STT-RAM cache archi-
tecture. In Proceedings of the International Symposium on High Performance Computer Architecture. IEEE Computer
Society, 25–36.
[6] Dmytro Apalkov et al. 2013. Spin-transfer torque magnetic random access memory (STT-MRAM). J. Emerg. Technol.
Comput. Syst. 9, 2, Article 13 (May 2013), 35 pages.
[7] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization
and Architectural Implications. Technical Report. Princeton University.
[8] Nathan Binkert et al. 2011. The Gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7.
[9] Yu-Ting Chen, Jason Cong, Hui Huang, Bin Liu, Chunyue Liu, Miodrag Potkonjak, and Glenn Reinman. 2012. Dynami-
cally reconfigurable hybrid cache: An energy-efficient last-level cache design. In Proceedings of the Design, Automation
Test in Europe Conference Exhibition (DATE’12). 45–50.
[10] Ju-Hee Choi and Gi-Ho Park. 2017. NVM way allocation scheme to reduce NVM writes for hybrid cache architecture
in chip-multiprocessors. IEEE Trans. Parallel Distrib. Syst. 28, 10 (2017), 2896–2910.
[11] Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and
area model for emerging nonvolatile memory. Trans. Comp.-aided Des. Integ. Cir. Syst. 31, 7 (2012), 994–1007.
[12] Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001.
MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the 4th Annual IEEE
International Workshop on Workload Characterization. (WWC’01). IEEE, 3–14.
[13] John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006),
1–17.
[14] N. P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache
and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture.
[15] Samira M. Khan, Daniel A. Jiménez, Doug Burger, and Babak Falsafi. 2010. Using dead blocks as a virtual victim cache.
In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10).
489–500.
[16] Namhyung Kim, Junwhan Ahn, Woong Seo, and Kiyoung Choi. 2015. Energy-efficient exclusive last-level hybrid
caches consisting of SRAM and STT-RAM. In Proceedings of the IFIP/IEEE International Conference on Very Large Scale
Integration (VLSI-SoC’15). 183–188.
[17] Y. B. Kim et al. 2011. Bi-layered RRAM with unlimited endurance and extremely uniform switching. In Proceedings
of the International Conference on VLSI. 52–53.
[18] Kyle Kuan and Tosiron Adegbija. 2019. Energy-efficient runtime adaptable L1 STT-RAM cache design. IEEE Trans.
Comput.-aided Des. Integ. Circ. Syst. 39, 6 (2019).
[19] Kyle Kuan and Tosiron Adegbija. 2019. Halls: An energy-efficient highly adaptable last level STT-RAM cache for
multicore systems. IEEE Trans. Comput. 68, 11 (2019), 1623–1634.
[20] Dongwoo Lee and Kiyoung Choi. 2014. Energy-efficient partitioning of hybrid caches in multi-core architecture. In
Proceedings of the VLSI-SoC: Internet of Things Foundations and 22nd IFIP WG 10.5/IEEE International Conference on
Very Large Scale Integration, (VLSI-SoC’14). 58–74.
[21] I. C. Lin and J. N. Chiou. 2015. High-endurance hybrid cache design in CMP architecture with cache partitioning and
access-aware policies. IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 23, 10 (Oct. 2015), 2149–2161.
[22] J. Luo, H. Cheng, I. Lin, and D. Chang. 2019. TAP: Reducing the energy of asymmetric hybrid last-level cache via
thrashing aware placement and migration. IEEE Trans. Comput. 68, 12 (Dec. 2019), 1704–1719.
[23] S. J. E. Wilton and N. P. Jouppi. 1996. CACTI: An enhanced cache access and cycle time model. IEEE Journal of Solid-
State Circuits 31, 5 (May 1996), 677–688. DOI:10.1109/4.509850

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching 5:27

[24] Arijit Nath, Sukarn Agarwal, and Hemangee K. Kapoor. 2020. Reuse distance-based victim cache for effective utili-
sation of hybrid main memory system. ACM Trans. Des. Autom. Electron. Syst. 25, 3, Article 24 (Feb. 2020), 32 pages.
[25] Sobhan Niknam, Arghavan Asad, Mahmood Fathy, and Amir-Mohammad Rahmani. 2015. Energy efficient 3D hy-
brid processor-memory architecture for the dark silicon age. In Proceedings of the 10th International Symposium on
Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC’15). 1–8.
[26] Moinuddin K. Qureshi, Sudhanva Gurumurthi, and Bipin Rajendran. 2011. Phase change memory: From devices to
systems. Synth. Lect. Comput. Archit. 6, 4 (2011), 1–134.
[27] D. Stiliadis and A. Varma. 1997. Selective victim caching: A method to improve the performance of direct-mapped
caches. IEEE Trans. Comput. 46, 5 (1997), 603–610.
[28] Guangyu Sun, Chao Zhang, Peng Li, Tao Wang, and Yiran Chen. 2016. Statistical cache bypassing for non-volatile
memory. IEEE Trans. Comput. 65, 11 (Nov. 2016), 3427–3440.
[29] Zhenyu Sun, Xiuyuan Bi, Hai Li, Weng-Fai Wong, Zhong-Liang Ong, Xiaochun Zhu, and Wenqing Wu. 2011. Multi
retention level STT-RAM cache designs with a dynamic refresh scheme. In Proceedings of the 44th Annual IEEE/ACM
International Symposium on Microarchitecture. 329–338.
[30] Z. Wang, D. A. Jiménez, C. Xu, G. Sun, and Y. Xie. 2014. Adaptive placement and migration policy for an STT-
RAM-based hybrid cache. In Proceedings of the IEEE 20th International Symposium on High Performance Computer
Architecture (HPCA’14). 13–24.
[31] Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, and Yuan Xie. 2009. Power and performance of read-write aware
hybrid caches with non-volatile memories. In Proceedings of the Conference on Design, Automation and Test in Europe.
737–742.
[32] Ying Zheng, Brian T. Davis, and Matthew Jordan. 2004. Performance evaluation of exclusive cache hierarchies. In
Proceedings of the IEEE International Symposium on ISPASS Performance Analysis of Systems and Software. 89–96.

Received August 2019; revised April 2020; accepted July 2020

ACM Transactions on Embedded Computing Systems, Vol. 20, No. 1, Article 5. Publication date: December 2020.

You might also like