A Survey of Software Techniques For Using No-Volatile Memories For Storage and Main Memory Systems
A Survey of Software Techniques For Using No-Volatile Memories For Storage and Main Memory Systems
Abstract—Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and resistive
RAM, offer several advantages and challenges when compared to conventional memory technologies, such as DRAM and
magnetic hard disk drives (HDDs). In this paper, we present a survey of software techniques that have been proposed to exploit
the advantages and mitigate the disadvantages of NVMs when used for designing memory systems, and, in particular, secondary
storage (e.g., solid state drive) and main memory. We classify these software techniques along several dimensions to highlight
their similarities and differences. Given that NVMs are growing in popularity, we believe that this survey will motivate further
research in the field of software technology for NVMs.
Index Terms—Review, classification, non-volatile memory (NVM) (NVRAM), flash memory, phase change RAM (PCM)
(PCRAM), spin transfer torque RAM (STT-RAM) (STT-MRAM), resistive RAM (ReRAM) (RRAM), storage class memory (SCM),
Solid State Drive (SSD) .
Copyright (c) 2015 IEEE. This is author’s version. The final version is available at http://dx.doi.org/10.1109/TPDS.2015.2442980
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 2
solid state drive) and main memory. We classify these parameters. For more details on the device-level prop-
software techniques along several dimensions to high- erties of memory technologies, we refer the reader to
light their similarities and differences. We also discuss previous works [6, 25, 36, 65, 86, 100, 106].
those papers which compare or combine multiple In Table 1, access granularity refers to the minimum
non-volatile or conventional memory technologies. amount of data that are read/written in each access
Terminology and Scope: Following other re- and endurance refers to the number of writes a mem-
searchers (e.g. [121]), we use SCM to refer to byte- ory block can withstand before it becomes unreliable.
addressable non-volatile memories such as STT-RAM, The property common to all NVMs is that their write
PCM and ReRAM and use NVM to refer to all the latency/energy are significantly higher than that of
SCMs and Flash. We discuss techniques proposed read latency/energy. Also, under normal conditions,
for both single-level cell (SLC) and multi-level cell they retain data for several years without the need of
(MLC) memories. Unless otherwise mentioned, Flash any standby power. The specific properties of different
refers to the NAND Flash and not the NOR Flash, NVMs are discussed below.
and SSD refers to Flash-based SSD and not PCM-
based SSD. Also, some techniques proposed for other 2.1 Flash
memory technologies (e.g. HDDs or DRAM) may also Flash memory has three types of operations, namely
be applied to NVMs, however, we include only those read, program (write), and erase. A write operation
techniques which have been proposed in context of can only change the bits from 1 to 0, and hence, the
NVMs. Since different techniques have been evalu- only way to change a bit in a page from 0 to 1 is to
ated in different contexts, we only present their main erase the block that contains the page which sets all
idea and do not show the quantitative results. bits in the block to 1. Since erase operations are signif-
A few previous papers focus on the device-level icantly slower than the write operations, Flash SSDs
properties of Flash memory [97] or review use of use Flash translation layer (FTL) [46] which helps
SCMs for cache and main memory [83, 86]. By com- in ‘hiding’ the erase operations and thus exposing
parison, this paper surveys system- and software-level only read/write operations to the upper layers. FTL
techniques proposed for all NVMs for addressing maintains a mapping table of virtual addresses from
several important aspects, such as design of persistent upper layers to physical addresses on the Flash and
memory systems, lifetime enhancement, reliability, uses this to perform wear-leveling.
cost efficiency, energy efficiency, design of hybrid The pages in Flash are classified as valid, invalid
memory systems etc. We classify the techniques based (containing dead data) and free (available for storing
on several key features/characteristics to highlight new data). To hide the latency of erase operations,
their similarities and differences. By providing a syn- FTL performs out-of-place writes whereby it writes
thetic overview of existing frontiers of NVM manage- to a free page and invalidates the previous location of
ment techniques, we aim to provide clear directions the page and udpates the mapping. Read and write
for future research in this area. This survey paper is operations are done at page granularity, while erase
expected to be useful for researchers, OS designers, operation is done at block granularity. Typically, the
computer architects and others. size of a page and a block are 4KB and 256KB, re-
The rest of the paper is organized as follows. Section spectively. FTLs also perform garbage collection (GC),
2 presents a background on the characteristics and whereby invalid pages are reclaimed by erasing the
limitations of NVMs. Section 3 classifies the research blocks and relocating any valid pages within them to
projects on NVMs based on several parameters and new locations (if required). Clearly, given the crucial
then discusses a few of them. Section 4 discusses re- impact of FTL on the performance of Flash SSDs and
search projects which study integration or comparison presence of large number of factors involved in its
of multiple memory technologies. Finally, Section 5 design (e.g. GC policies, page v/s block mapping, size
presents the conclusion and future challenges. and storage location of its mapping table etc.), FTL
requires discussion of its own and hence, we refer the
reader to previous works for more details [23, 46].
2 A B RIEF OVERVIEW OF M EMORY T ECH - The cell-size of Flash is 4-6F 2 , while that of SRAM
NOLOGIES is 120-200F 2 [86]. Clearly, due to its high density and
latency and low write endurance, Flash is generally
In this section, we briefly summarize the properties suitable for use as a storage device or a caching layer
and challenges of different memory technologies. For between DRAM and HDD. Flash is a mature NVM
sake of comparison we also discuss conventional technology and is being developed and deployed by
memory technologies such as HDD (hard disk drive) several commercial manufacturers [2].
and DRAM. Table 1 presents the device level prop- Compared to the rotating media (viz. HDD), Flash
erties of different memory technologies. Note that is based on semiconductor chips which leads to com-
these values should be taken as representatives only pact size, low power consumption and better perfor-
since ongoing research may lead to changes in these mance for random data accesses. Also, SSDs have no
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 3
TABLE 1: Approximate device-level properties of memory technologies (lat. = latency) [30, 36, 86]
Cell size Access Granularity Read Lat. Write Lat. Erase Lat. Endurance Standby Power
HDD N/A 512B 5ms 5ms N/A > 1015 1W
SLC Flash 4-6F 2 4KB 25µs 500µs 2ms 104 − 105 0
DRAM 6-10F 2 64B 50ns 50ns N/A > 1015 Refresh power
PCM 4-12F 2 64B 50ns 500ns N/A 108 − 109 0
STT-RAM 6-50F 2 64B 10ns 50ns N/A > 1015 0
ReRAM 4-10F 2 64B 10ns 50ns N/A 1011 0
moving parts, no mechanical wearout, and are resis- modeling of operating system (OS) operations, which
tant to heat and shock. However, the relative perfor- is not provided by the typical user-space simulators.
mance advantage of SSD over HDD highly depends The simulators may also use simplistic models and
on the workload characteristics. For example, it has thus, miss crucial real-world details. On the other
been shown that for many write-intensive scientific hand, real hardware platforms do not provide ex-
workloads, SSDs may provide only marginal gain perimentation flexibility like that of a simulator and
over HDDs [70]. Also, due to its higher-cost, SSD can- given the emerging nature of SCMs, real hardware
not completely replace HDD [90]. Thus, the research platforms with SCM-based storage are also generally
challenges for Flash which need to be addressed at unavailable. This presents a challenge in the study of
system level include lifetime enhancement by mini- NVMs and makes the choice of a suitable evaluation
mizing the number of write/erase operations, wear- platform crucially important. For this reason, Table
leveling, managing faulty blocks, and performance 2 further classifies the techniques based on whether
improvement by retention relaxation and design of they have been evaluated using a simulator or real
hybrid memory systems. hardware (e.g. a CPU or an FPGA) to help the readers
gain insight.
Table 2 also classifies the techniques based on their
2.2 PCM, STT-RAM and ReRAM
optimization objectives and the essential approach.
The most important feature of these three mem- We now discuss some of these techniques in bottom-
ory technologies (referred to as SCMs) which distin- to-top order of abstraction levels across the software
guishes them from Flash, is that that they are byte- stack, beginning with architecture, firmware, middle-
addressable. Computer systems have traditionally ware (I/O, operating system etc.) up to programming
used DRAM as a volatile memory and HDD and Flash models/APIs (application programming interfaces),
as persistent storage. The difference in their latencies although note that several of these techniques span
has, however, led to large differences in their inter- across these ‘boundaries’.
faces [19, 82]. These SCMs offer the promise of storage
capacity and endurance similar to or better than Flash
3.1 Write/erase overhead minimization
while providing latencies comparable to DRAM. For
these reasons, the SCMs hold the promise of being NVMs in general have low write-endurance and due
used as universal memory technologies. Although, to write-variation introduced by workloads, a few
compared to Flash and DRAM, the SCMs are less blocks may receive much higher number of writes
mature, yet significant amount of research has been than the remaining blocks. This issue can be ad-
done in recent years for developing and utilizing them dressed by both minimizing the writes/erasures and
[86], for example, a 16Gb ReRAM prototype has been uniformly distributing them over all the blocks (called
recently demonstrated [38] which features an 8-bank wear-leveling). We now discuss the techniques based
concurrent DRAM-like core architecture and 1GB/s on these approaches.
DDR interface. The system-level techniques proposed Huang et al. [48] propose a technique to reduce
for these memories address several key research chal- write-traffic to SSDs, which also increases their life-
lenges such as integrating them in memory/storage time. Their technique employs delta-encoding for se-
hierarchy to design persistent memory systems and mantic blocks and deduplication to data blocks. For
complement conventional memory technologies, and every block write request, it decides whether it is se-
enhancing lifetime, reliability and performance etc. mantic or data block. For data block write, it computes
Sections 3 and 4 discuss the techniques proposed MD5 digest to determine whether it is a duplicate
for addressing these issues. block write. For duplicate blocks, their technique re-
turns existing block number in the found hash entry
and thus avoids the need of allocating hash table
3 NVM M ANAGEMENT T ECHNIQUES memory. If it is a semantic block (which include super-
Table 2 classifies the research projects based on the blocks, group descriptors, data block bitmap etc.),
NVM used and the level of memory hierarchy at their technique calculates the content delta relative
which the NVM is used. Many studies involving to its original content, and then appends the delta
secondary storage require long execution time and/or to a delta-logging region. Since semantic blocks are
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 4
TABLE 2: A classification of techniques are written in memory. Also, their technique does
Classification References
not harm the error performance of the application.
Non-volatile memory used
Flash [6, 8, 11, 13–19, 22–28, 31, 34, 35, 40, 44–
Further, in a 4-level NOR Flash memory, ‘11’ state is
48, 51, 53, 54, 58–60, 62–64, 68–71, 73–75, called erase state since it does not require program-
77, 78, 80, 81, 89–92, 94–96, 98, 101, 102, ming the memory cell while the remaining states are
104–106, 108, 110, 112–118, 122–127, 129–
131, 133–135, 137, 139]
called programmed state. By reducing the ‘01’ and
PCM [7, 9, 10, 12, 19–21, 29, 30, 32, 33, 36, 37, ‘10’ patterns, their technique reduces the number of
39–42, 49, 50, 52, 55, 56, 60, 67, 72, 76, 79, programming cycles which improves the lifetime of
80, 82, 87, 88, 99, 102, 103, 105, 107, 109, Flash memory.
111, 112, 117, 119, 121, 132, 136]
STT-RAM [12, 19, 32, 61, 72, 82, 88, 99, 107, 121, 138] Grupp et al. [44] use WOM (write-once memory)
ReRAM [12, 21, 37, 82, 88, 119, 121, 132] coding scheme to increase the lifetime of Flash mem-
Level in memory hierarchy where NVM is used ory and also save energy. By using extra bits, WOM
Secondary storage [6–9, 13, 15, 17, 18, 23–26, 35, 40, 42, 45–49,
(and its cache etc.) 51, 52, 54, 58–60, 62–64, 68–71, 73–77, 90, code allows multiple logical value to be written even
92, 96, 98, 101, 105, 106, 108, 112, 113, 115– if physical bits can transition only once. By virtue
117, 123–127, 129, 131, 134, 135, 137, 139] of this, WOM code allows writing data to a block
Main memory [10, 29, 30, 32, 33, 41, 50, 52, 56, 61, 72,
76, 87–89, 102, 103, 105, 107, 109, 111, 114,
twice before erasing it, which reduces the number of
121, 136, 138] erasures required.
On-chip cache [107, 138] In embedded systems, Flash can be used as main
Evaluation platform
Simulator [6, 9, 10, 13, 17, 18, 22, 23, 28, 29, 31–33,
memory, however, its large data access granularity
35–37, 42, 46, 47, 50, 51, 55, 58–60, 62, 68, (e.g. 4KB v/s 64B used in cache) and small endurance
70–72, 75, 76, 78, 82, 87, 91, 92, 94, 101– present challenges. Shi et al. [114] present a technique
105, 107–109, 111–117, 123–127, 129, 131,
134, 138, 139]
to reduce the number of writes on Flash main mem-
Real hardware [11, 12, 14, 15, 19–21, 25, 26, 31–34, 40, 41, ory to improve its lifetime and also bridge the gap
44, 48, 52, 56, 57, 59–61, 67, 70, 74, 82, 88, between access granularities. They observe that due to
89, 96, 98, 99, 104, 118, 119, 121, 122, 125, data locality, the data accessed in consecutive writes
132, 133, 135, 137, 139]
Study/optimization approach/objective come from a limited number of pages. Based on this,
Performance [6–8, 11–13, 19–21, 25, 26, 31–33, 35, 36, they use victim cache, along with a write-buffer to
improvement 40, 44, 46–48, 52, 54, 56–64, 67–70, 74, 75, perform write-coalescing, since for multiple last level
78, 79, 90, 92, 94–96, 98, 101–103, 105, 108,
110, 112, 113, 115–119, 121, 122, 126, 129, cache write-back operations, only a single (or few)
131, 133–135, 137–139] write takes place to Flash main memory.
Energy efficiency [10, 31, 39, 40, 44, 58, 72, 90, 94, 95, 102, Qureshi et al. [103] note that only the SET operation
103, 115, 117, 138]
Lifetime [6, 11, 14, 15, 17, 18, 23, 24, 28, 29, 39, 41, in PCM writes is slow while the RESET operation is
improvement 44, 48–52, 54, 58, 62, 69, 71, 73, 74, 77, 79, nearly as fast as the reads. Hence, by proactively per-
80, 88, 94, 102, 105, 109, 111–114, 116, 117, forming SET for all bits much before the anticipated
123–126, 134–136]
Wear-leveling [6, 7, 14, 22–24, 49, 51, 58, 73, 79, 87, 102,
write, the time consumed in write can be reduced
109, 117, 123–125, 127] which also leads to saving of energy. Based on this,
Write/erase [9, 10, 39, 48, 58, 59, 71, 74, 87, 101, 102, their technique initiates SET operation as soon as the
overhead 105, 112, 114, 116, 126, 134, 135, 138]
minimization
line becomes dirty in the cache.
Salvaging faulty [10, 17, 29, 41, 50, 80, 92, 105, 111, 123, 136]
blocks and
scrubbing 3.2 Wear-leveling
NVM retention re- [76, 78, 92, 113, 131]
laxation Chang et al. [24] propose a static wear-leveling al-
Persistency and [21, 32, 37, 55, 61, 76, 89, 99, 104, 107, 108, gorithm. In their algorithm, the blocks are divided
consistency 121, 121, 128, 138] into sets and each set is associated to a bit in the
Checkpointing, re- [9, 10, 12–18, 33, 36, 41, 47, 50, 56, 58, 75,
liability and error- 77, 78, 94, 104, 105, 107–109, 111, 113, 127,
Block Erasing Table (BET). Initially, all bits in the
correction 139] BET are ’0’. If a member of a set is erased within
Data-value depen- [80, 94] the interval, its associated bit is transitioned to 1.
dent optimization
Cost efficiency [8, 60, 62, 90, 98, 101, 106]
The total number of erasures in any interval is also
recorded. If the ratio of the number of erasures over
visited much more frequently than data blocks, with the number of 1s in the BET reaches a predefined
each update bringing very minimal changes; their threshold, a set whose corresponding bit is still 0 is
technique reduces the number of writes. randomly selected. Afterwards, all valid data in this
Papirla et al. [94] note that the write latency and set are moved to a free block set, and the former set
energy in a Flash memory is data-dependent, for ex- is erased for future use.
ample, in a 4-level Flash, writing ‘01’ and ‘10’ patterns Wang et al. [125] observe that since OS has the
incurs higher latency and energy than writing ‘00’ and knowledge about files at a higher level of abstraction
‘11’ patterns. They propose a data-encoding technique (e.g. the file type data belong to, the applications that
to reduce the number of ‘01’ and ‘10’ patterns that are using them etc.), this information can be used to
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 5
provide hints to lower-level FTL. The FTL uses these significantly enhanced by tolerating the failure of a
hints to deduce the update frequency and recency few blocks and/or salvaging faulty blocks. Several
of files and thus performs better wear-leveling, since techniques have been proposed for this and we now
block allocation is done in a manner that young blocks discuss a few of them.
are allocated to hot data and old blocks are allocated Liu et al. [77] present fault-tolerance based tech-
to cold data. niques to improve the lifetime of Flash-based SSD
Chang et al. [23] present a wear-leveling algorithm caches. Since the erroneous data in write-through SSD
which aims to reduce the wear of elder (i.e. a block caches can be recovered by accessing the HDD, one
with high erase count) blocks. Their technique tracks of their technique converts the uncorrectable errors
the erase recency of blocks and whenever the erase into cache misses that bring in valid data from HDD.
recency of an elder block becomes higher than the Another technique utilizes a fraction of SSD cache
average by a pre-determined threshold, the logical capacity to increase the ECC (Error-correcting code)
blocks with a low update recency are remapped to strength when Flash reaches wearout thresholds, thus
these elder blocks. The algorithm dynamically tunes the SSD cache continues operating with reduced ca-
the threshold to strike a balance between the benefit pacity.
and cost of wear-leveling. Cai et al. [17] present a technique to improve the
Wang et al. [124] propose observation wear-leveling Flash lifetime by reducing the raw bit error rate, even
(OWL), which aims to proactively avoid the uneven- when Flash memory has endured high P/E cycles
ness of erasures through monitoring temporal locality far beyond its nominal endurance. Their technique
and block utilization. Their technique uses a table periodically reads each page in Flash memory, corrects
named the Block Access Table (BAT), in which a block its errors using simple ECC, and either remaps the
is a logical block (and not a Flash block). The BAT page to a different location or re-programs it in its
stores access frequencies of logical blocks that have original location, before the page accumulates more
been recently rewritten. Using BAT, the OWL algo- errors than can be corrected with simple ECC. Thus,
rithm ranks data of logical blocks, and allocates Flash the lifetime of Flash memory is improved by address-
blocks accordingly. The ranking is used to predict a ing retention errors which form the most dominant
logical block’s relative access frequency in the near type of errors in Flash memory [15, 16].
future. In this way, OWL puts data into suitable blocks Wang et al. [123] propose a method to salvage bad
in a proactive way to achieve wear-leveling. blocks in Flash to extend its lifetime. Their method
Jeong et al. [51] show that by slowly erasing a works on observation that many pages in a bad block
Flash block with lower erase voltage, the endurance may still be healthy and hence, discarding a block
of Flash can be improved. Based on this, they provi- on failure of a few pages leads to wastage and small
sion multiple write and erase modes with different lifetime. Their method combines the healthy pages of
operation voltages and speeds to extend the life- a set of bad blocks together to form a smaller set of
time of Flash memory with minimal effect on the virtually healthy blocks which can be used to store
application throughput. At software level, garbage cold data.
collector and wear-leveler are modified to utilize these Maddah et al. [80] present a physical block sparing
write/erase modes. For example, instead of using the scheme that delays the retirement of a faulty block
same endurance for all the blocks (as in a baseline when the Flash or PCM memory exhibits failure due
Flash), their wear-leveler uses the effective endurance to write-endurance limitation. On reaching its write-
(as enabled by their lifetime extension approach) to endurance limit, a PCM or Flash block shows “stuck-
evenly distribute the effective wearing among Flash at” fault, which means that it gets stuck at either 0
blocks. or 1, and can still be read but not reprogrammed
Im et al. [49] present a wear-leveling technique [16]. Thus, the occurrence of errors becomes data
for PCM-based storage. Their technique counts the dependent; an error manifests only when a different
write-counts on PCM pages and if a logical page is bit value is written to a faulty cell than what it is stuck
frequently updated, their technique allocates larger at. Based on this, on an unsuccessful write to a block,
number of physical pages to it to balance the writes their scheme does not immediately retire a block.
on all pages. Thus, the logical pages have a different Instead, a spare block is temporarily borrowed from
number of physical pages allocated to them based on the spare pool of blocks. Later on, write operation is
the update frequency. They have also shown that their again attempted on the original (faulty) block which
technique keeps the number of additional writes due is likely to succeed if the data are same as the existing
to wear-leveling small. data on the block. In such a case, the spare block
borrowed above is returned to the pool. When the
faulty block shows failures more than a threshold, it
3.3 Salvaging faulty blocks and tolerating failures is finally classified as bad and retired. In effect, for
The ‘raw lifetime’ of a memory system is decided the same lifetime improvement, their scheme reduces
by the first failure of a block. This lifetime can be the requirement of spare blocks or achieves higher
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 6
lifetime for the same spare pool capacity. physical page over two faulty PCM pages, as long
Yoon et al. [136] present a fine-grained remapping as there is no byte position that is faulty in both
technique to protect NVM against errors. Conven- pages. Using this, every byte of physical memory
tionally, when a block accumulates more wear-out can be served by at least one of the two replicas.
failures than can be corrected, it gets disabled and Since even in pages with hundreds of bit failures, the
remapped. Their technique utilizes the still-functional probability of finding such a pair of pages is high,
cells of worn-out memory blocks to store the redi- their technique leads to large improvement in lifetime
rection address using heavily redundant code. The of PCM memory over conventional error-detection
spare area is created dynamically by the OS within the techniques.
main memory. When a remapped block itself fails, it Sampson et al. [105] present techniques which
will be remapped further, which may create chained trade-off data accuracy for gaining performance, life-
remapping. To avoid this, their technique writes the time and density in NVMs for applications which can
final address to the original failed block. The benefit of tolerate data inaccuracies. Their first technique allows
their fine-grained remapping technique is that failure errors in MLC memories by reducing the number
of a 64B block does not lead to remapping or disabling of programming pulses used to write them, which
of an entire 4KB page. provides higher performance and density. The sec-
Gao et al. [41] present a hardware-software coop- ond technique uses the blocks that have exhausted
erative approach to tolerate failures in NVM main their hardware error correction resources to store ap-
memory. Their approach makes error handling trans- proximate data which increases the memory lifetime.
parent to the application by using the memory ab- Further, to reduce the effect of failed bits on overall
straction offered by garbage-collected managed lan- result, correction of higher-order bits is prioritized.
guages such as Java, C#, JavaScript etc. The runtime Several multimedia and graphics applications can
ensures that memory allocations never use the failed tolerate minor errors since these errors are not per-
lines and moves data when the memory lines fail ceived by human end-users [84, 93]. Fang et al. [39]
during program execution. Conventional hardware- leverage this property to reduce the write-traffic to
only schemes which use wear-leveling, delay a single PCM. In their technique, if the originally stored data
failure, however, their limitation is that due to wear- are very close to the new data to be written, the write
leveling, after a large number of writes, failures hap- operation is canceled and originally stored data are
pen uniformly throughout the memory which causes taken as the new data. This also reduces the energy
fragmentation. By contrast, their technique uses a consumption of the memory.
low-cost “failure clustering hardware” which logically
remaps failed lines to top or bottom edge of the region 3.4 Retention Relaxation
to maximize the contiguous space available for object Flash memory allows trading-off the retention period
allocation. On the first failure, this hardware also with write speed and P/E (program/erase) cycles. For
installs a pointer to the boundary between normal and example, the Flash memory can be programmed faster
failed lines. Thus, their technique reduces fragmenta- but with shorter retention time guarantee and thus,
tion and improves performance under failures. the write-operations can be made faster. This prop-
Zhao et al. [139] note that the raw BER of Flash erty also applies to other NVMs. Several techniques
memory varies dramatically under different P/E cy- utilize this property for optimization of NVM memory
cles with different retention time. Hence, the ECC systems. We now discuss a few of them.
used to ensure reliability over-protects the Flash Pan et al. [92] present an approach to relax the
memory, since most pages show better-than-worst- retention period of Flash for improving P/E cycling
case reliability. To utilize the residual error-correction endurance and/or programming speed. To avoid los-
strength, they propose error-prone over-clocking of ing data at small retention periods, data refresh op-
Flash memory chip I/O links which translates into erations are used. Further, to minimize the impact
higher performance, and the ECC becomes respon- of refresh operations on normal I/O requests, they
sible for the errors caused by both Flash memory propose a scheduling strategy which performs several
storage and controller-Flash data transfer. Based on optimizations, such as giving higher priority to I/O
the study of over-clocking on data read path vs. requests, preferring issue of refresh operations when
over-clocking on data write path, they show that the the SSD is idle, etc.
former is much more effective and favorable than the Shi et al. [113] present a technique to improve the
latter. This is because, the over-clocking on write-path endurance and write-performance of Flash memory.
can lead to permanent data storage errors, while the Based on the error model of Flash, their technique
impact of over-clocking on read-path can be more applies different optimizations at different stages of
easily tolerated. Flash lifetime. In the first (i.e. early lifetime) stage,
Ipek et al. [50] present a method to allow graceful the retention time and P/E cycles are traded-off
degradation of PCM capacity on occurrence of hard to optimize performance by maximizing the write-
failures. Their technique works by replicating a single speed. In the second (i.e. middle lifetime) stage, the
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 7
write operations are differentiated into hot writes degrade the performance of Flash. To address this,
and cold writes and the hot writes are speeded up Kgil et al. [58] propose splitting the Flash based disk
while the cold writes are slowed down to improve cache into separate read and write regions (e.g. 90%
performance with little effect on endurance. In the last read region and 10% write region). Read critical Flash
stage, the write speeds are slowed down to extend blocks are located in the read region that may only
the endurance. They also propose a smart refresh evict Flash blocks and pages on read misses. The
approach which refreshes only the data approaching write region captures all the writes to the Flash and
the retention time. performs out-of-place writes. Wear-leveling is applied
Liu et al. [78] analyze different datacenter work- globally to all the regions. Partitioning of Flash into
loads and observe that for most workloads, the data read/write region reduces the number of blocks that
are frequently updated and hence, they require only need to be considered for garbage collection, which
few days of retention. This is much shorter than the significantly reduces the number of read, write and
retention time typically specified for NAND Flash. erase operations.
This gap in retention time can be exploited to optimize Lee et al. [68] propose a semi-preemptive garbage
the operation of SSD. Also, with retention relaxation, collection scheme to improve the performance of
fewer retention errors need to be tolerated and hence, garbage collection in Flash memory. The moving oper-
ECC of lower complexity can be used to protect data ation of a valid page involves page read, data transfer,
which reduces the ECC overhead [85]. page write and meta data update. If the origin and
Huang et al. [47] present a technique to improve the destination blocks are in the same plane, the data-
performance of a Flash-based SSD. Their technique transfer operation can be replaced by copy-back op-
selectively replaces ECC with EDC (error detection erations. Based on this, their scheme decides possible
code) based on the consistency and reliability require- preemption points and allows preemption only on
ments of the pages. Specifically, when writing fresh those points to minimize the preemption overhead,
data which have no backup in the next storage layer, hence the name semi-preemptive. Preemption allows
their technique uses ECC, otherwise it uses EDC. servicing pending I/O requests in the queue which
Since the decoding latency of EDC is much smaller, improves response time. Also, if the incoming request
read access to EDC-protected pages is speeded up, accesses the same page in which the GC process is
which improves the performance. On data corruption, attending, it can be merged and if it accesses different
a page is accessed from lower storage layer, thus the page but is of the same type as current request (e.g.
reliability is not sacrificed. read after read), it can be pipelined. They have shown
Wu et al. [131] present a technique to reduce that their garbage collection scheme is especially use-
the response time of SSD. Their technique monitors ful for heavily bursty and write-dominant workloads.
the write-intensity of the workload, and when the Wang et al. [126] propose an I/O scheduler for
write request queue contains several overlapped write Flash-based SSDs which improves performance by
transactions, it increases the memory programming leveraging the parallelism inherent in SSDs. The
speed at the cost of shorter data retention time. Later scheduler speculatively divides the whole SSD space
on when the write intensity again becomes low, the into many subregions and associates each subregion
short-lifetime data are rewritten to ensure data in- with a dedicated dispatching subqueue. Incoming
tegrity. They also develop a scheduling solution to requests are placed into a subqueue corresponding
implement their write strategy. to their accessing addresses. This facilitates simulta-
neous execution of the requests leading to enhanced
3.5 Flash I/O scheduling parallelism. Their scheduler also sorts the pending
Wu et al. [129] note that in Flash memory, once a requests in the same subqueue to create sequentiality
program/erase operation is issued to the Flash chip, and reduce the harmful effect of random writes on
the subsequent read operations have to wait till the performance and lifetime.
slow program/erase operation is completed. They
propose different strategies to suspend the on-going 3.6 Programming models and APIs for persistent
program and erase operations to service pending memory system design
reads and resume the suspended operation later on. The non-volatility property of NVMs enables design
During a program operation, the page buffer contains of persistent memory systems which can survive
the data to be written, which may be lost when a program and system failures [66]. Several techniques
read request arrives. To address this, they provision a have been proposed to implement and optimize it. We
shadow buffer where the contents of page buffer are now discuss a few of them.
stored during suspension. After completion of read Volos et al. [121] present Mnemosyne, an interface
operation, it reloads the page buffer with the original for programming with SCMs. Mnemosyne enables
data. applications to declare static variables with values that
In Flash based disk cache, out-of-place writes in- persist across system restarts. It also allows the appli-
crease the overhead of garbage collection and also cation to allocate memory from the heap, which is
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 8
backed by persistent memory. The ordering of writes cles. Their technique continually monitors the voltage
to persistent memory is controlled by software using of the energy source and when it drops below a
a combination of non-cached write modes, cache line threshold, the volatile stage of the program is written
flush instructions, and memory barriers. Further, it to NVM, where it survives a power loss. At next-boot,
provides primitives for directly modifying persistent a checkpoint-restoration routine copies the state back
variables and supports consistent updates through a to volatile memory for resuming execution.
lightweight transaction mechanism. Dong et al. [36] propose use of PCM to improve
Coburn et al. [32] present NV-heaps, a persistent checkpointing performance. Conventionally, HDD is
object system for providing transactional semantics used for making checkpoints, however, its low band-
and a persistence model. For ensuring application- width leads to large overheads in checkpointing. Also,
consistency, NV-heaps supports the application in while DRAM is relatively fast, its high leakage and
separating volatile and nonvolatile data. As an ex- volatile nature makes it unsuitable for checkpoint-
ample, applications can ensure that no pointers ex- ing and while Flash is non-volatile, its low write-
ist from persistent memory to DRAM space, and endurance limits the checkpoint frequency. They use
thus, the consistency of persisted state is ensured a 3D PCM architecture for checkpointing which pro-
after a reboot. Also, applications can name heaps to vides high overall I/O bandwidth. Also, for a mas-
make programming with persistent media easier. Both sively parallel processing system, they propose a hy-
Mnemosyne and NV-Heaps help applications differ- brid local/global checkpointing mechanism, where in
entiate among DRAM, persistent memory (SCM), and addition to the global checkpoint, local checkpoints
Flash when allocating new pages for their heap or are also made that periodically backup the state of
stack. each node in their own private memory. The fre-
Narayanan et al. [89] present a persistence approach quency at which local/global checkpoints are made
for systems where all the system main memory is can be tuned to achieve a balance between perfor-
non-volatile. Their approach uses ’flush-on-fail’ which mance and resiliency, for example, systems with high
implies that the transient state held in CPU registers transient failures can make frequent local checkpoints
and cache lines is flushed to the NVM only on a and only few global checkpoints. Also, if on a failure,
failure (and not during program execution), using a local checkpoint itself is lost, the state of the system
small residual energy window provided by the system can be restored by the global checkpoint.
power supply. On the restart, the OS and application Condit et al. [33] present a transactional file system
states are restored like a transparent checkpoint, thus that leverages byte addressability of SCM to reduce
all the state is recovered after a failure. Thus, their the amount of metadata written during an update
approach provides suspend and resume functionality and achieves consistency through shadow updates.
and also eliminates the runtime overhead of flushing Their file system guarantees that file system writes
cache lines on each update. will become durable on persistent storage media in
Zhao et al. [138] present a persistent memory de- the time it takes to flush the cache and that each
sign that employs a non-volatile last level cache and file system operation is performed atomically and in
non-volatile main memory to construct a persistent program order. For ordering updates, they use epoch-
memory hierarchy. In their approach, when an NV barriers. A cache line is tagged with an epoch number
cache line is updated, it represents the newly updated and the cache hardware is modified to ensure that
version while the clean data stored in NV memory memory updates due to write backs always happen
represents the old version. When the dirty NV cache in epoch order.
line is evicted, the old version in NV memory will
be automatically updated. With this multiversioned 3.7 System-level redesign of NVM storage and
persistent memory hierarchy, their approach enables memory
persistent in-place updates without logging or copy- Accounting for the properties of NVMs at system-
on-write. This brings the performance of a system level helps in achieving several optimizations in the
with persistent support close to that of one without memory system/hierarchy, as shown by the following
persistent support. They also develop software inter- techniques.
face and architecture extensions to provide atomicity Jung et al. [52] propose a system architecture,
and consistency support for their persistent memory named Memorage that utilizes the persistent memory
design. resources of a system (viz. persistent main memory
Ransford et al. [104] present an approach for sup- and persistent storage device) in an integrated man-
porting execution of long-running tasks on tran- ner. It is well-known that due to over-provisioning
siently powered computers, such as RFID-scale (radio- of storage resources, its utilization remains small.
frequency identification) devices which work under To address this, Memorage collapses the traditional
very tight resource constraints. Their technique trans- static boundary between main memory and storage
forms a program into interruptible computations so resources and allows main memory to borrow di-
that they can be spread across multiple power lifecy- rectly accessible memory resources from the storage
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 9
device to cope with memory shortages. This avoids 4.1 Flash+ HDD
the need of swapping the pages. Further, since the
lifetime of storage device is much higher than that Chen et al. [26] present a technique to leverage low
of main memory, the system lifetime is limited by cost of HDD and high speed of SSD to bring the best
that of main memory. To address this, Memorage of both together. Their technique detects performance-
allows the storage device to donate its free capacity critical blocks based on workload access patterns and
to the main memory and thus, offers “virtual” over- moves only the most performance-critical blocks in
provisioning without incurring the cost for physical the SSD. Also, semantically-critical blocks (e.g. file
over-provisioning. This helps in extending the lifetime system metadata) are given priority to stay in SSD for
of main memory, at the cost of reduction in lifetime improving performance. Further, for improving the
of the storage. performance of write-intensive workloads, incoming
writes are buffered into the low-latency SSD.
Balakrishnan et al. [13] present a technique to
improve the reliability of an array of SSDs. Their Yang et al. [135] present a storage design consisting
technique works on the observation that balancing the of both HDD and SSD which aims to leverage the
number of writes in SSD arrays can lead to correlated best features of both. In their design, SSD stores read-
failures since they may use-up their erasure cycles intensive reference blocks and HDD stores the deltas
at similar rates. Thus, the array can be in a state between currently accessed blocks and the corre-
where multiple devices have reached the end of their sponding reference blocks. They also use an algorithm
erasure limits, and thus all of them need to be re- that computes deltas upon I/O writes and combines
placed by newer devices. Their technique distributes deltas with reference blocks upon I/O reads to in-
parity blocks unevenly across the array. Since the terface the OS. Thus, their approach aims to utilize
parity blocks are updated more often than data blocks fast read performance of SSD and fast sequential write
due to random access patterns, the devices holding performance of HDD, while avoiding slow SSD writes
more parity receive more writes and consequently age which improves its performance and lifetime.
faster. This creates an age differential on drives and Soundararajan et al. [116] propose a method to
reduces the probability of correlated failures. When an increase the lifetime of SSDs by using hard disk
oldest device is replaced by a new one, their technique drive (HDD) as a persistent write cache for an MLC-
reshuffles the parity distribution. based SSD. Their technique appends all the writes to
a log stored on the HDD and eventually migrates
them to SSD, preferably before subsequent reads.
They observe that HDDs can match the sequential
4 R ESEARCH P ROJECTS ON C OMBIN - write bandwidth of mid-range SSDs. Also, typical
ING OR C OMPARING M ULTIPLE M EMORY workloads contain a significant fraction of block over-
T ECHNOLOGIES writes and hence, by maintaining a log-structured
HDD, the HDD can be operated at its fast sequential
Since each memory technology has its own advan- write mode. At the same time, the write-coalescing
tages and disadvantages, several researchers propose performed by the HDD increases the sequentiality of
combining multiple technologies to bring together the the workload as observed by the SSD and also reduces
best of them, while others present a comparison study writes to it, which improves its lifetime.
to provide insights into the tradeoffs involved. Table 3 For virtual memory management, Liu et al. [74]
summarizes these techniques. We now discuss a few propose integrating HDD with SSD to overcome the
of them. limited endurance issue of Flash, while also bounding
performance loss incurred due to swapping to fulfill
TABLE 3: Classification of techniques utilizing or QoS requirements. Their technique sequentially swaps
comparing multiple memory technologies a set of pages of virtual memory to the HDD if
Classification References
they are expected to be read together. Further, based
Use of multiple memory technologies
HDD + SSD [26, 62, 64, 74, 81, 98, 115, 116, on the page-access history, their technique creates an
133, 135] out-of-memory virtual memory page layout which
PCM+Flash+HDD [60] spans both HDD and SSD. Using this, the random
PCM+Flash [112, 117]
reads can be served by SSD and the sequential reads
DRAM+ PCM or NOR Flash [87]
DRAM+PCM [36, 102] are asynchronously served by the HDD, thus the
DRAM+SCM [12, 33] total bandwidth of the two devices can be used to
DRAM+Flash [11] accelerate page swapping, while also avoiding their
Comparison of multiple memory technologies
SSD v/s HDD [8, 25, 62, 90, 96, 106]
individual limitations.
NVMs v/s DRAM and HDD [19] Wu et al. [133] present a technique to leverage both
PCM v/s HDD [36, 60] SSD and HDD to improve performance in a hybrid
STT-RAM v/s HDD [61]
storage system. Their technique utilizes both initial
block allocation and migration to reach an equilibrium
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 10
state where the response times of different devices and PCM SSDs can provide better read latency and
equalize. Their technique keeps track of the request aggregate I/O time than a Flash only configuration.
response times of different devices (viz. SSD or HDD)
and performs allocation to prefer the faster device 4.4 DRAM+PCM
while also achieving workload-balancing on the de-
vices. Also, their technique detects whether a block Qureshi et al. [102] present a hybrid main memory
is cold/hot and cold data are migrated only in the system where DRAM is used as a “page cache” for
background. the PCM memory to combine latency advantage of
Kim et al. [62] present a hybrid HDD-SSD design DRAM with capacity advantage of PCM. They pro-
that exploits the complementary properties of these pose several policies to mitigate write-overhead to
two media to provide high performance and service PCM memory and increase its lifetime. On a page
differentiation under a given cost budget. Their tech- fault, the page fetched from hard disk is written to
nique uses statistical models for performance of SSD DRAM only. This page is written to PCM only when it
and HDD to make dynamic request partitioning de- is evicted from DRAM and it is marked as dirty. Also,
cisions. Since random writes cause fragmentation on the writes to a page are tracked at the granularity of
SSD and increase the garbage collection overhead and a cache line and only the lines modified in a page are
latency of SSD, their technique periodically migrates written back to reduce the effective number of writes
some pages from SSD to HDD so that portions of to main memory. Further, to achieve wear-leveling,
writes can be redirected to the HDD and the ran- the lines in each page are stored in the PCM in a
domness can be reduced. They also develop a model rotated manner.
to find the most economical HDD/SSD configuration
for given workloads using Mixed Integer Linear Pro- 4.5 DRAM+SCM
gramming (ILP). Bailey et al. [12] present a persistent, versioned, key-
value store with a standard get/put interface for
4.2 Flash+PCM applications. Their approach maintains the permanent
Sun et al. [117] propose using PCM as the log region data in SCM and uses DRAM as a thin layer on
of the NAND Flash memory storage system. PCM top of SCM to address its performance and wear-
log region supports in-place updating and avoids the out limitations. The threads maintain and manipulate
need of out-of-date log records. Their approach re- local, volatile data in DRAM and only committed,
duces both read and erase operations to Flash, which persistent state is written to SCM, thus avoiding the
improve the lifetime, performance and energy effi- slow-write bottleneck of SCMs. The byte-addressable
ciency of Flash storage. Due to the byte-addressable nature of SCM is used for fine-grained transactions
nature of PCM, the performance of read-operations in non-volatile memory, and snapshot isolation is
is also improved. They also propose techniques to used for supporting concurrency, consistency, recov-
ensure that the PCM log region does not not wear erability and versioning, for example, on a power-loss
out before the Flash memory. failure, the data previously committed to SCM can be
accessed on resumption and will be in a consistent
4.3 PCM+Flash+HDD state.
Payer et al. [98] present a technique to leverage
Kim et al. [60] evaluate the potential of PCM in stor- both low-cost, high-capacity HDD and high-cost low-
age hierarchy, considering its cost and performance. capacity SSD at same level in memory hierarchy. Their
They observe that although based on the material-level technique allows using either a low-performance SSD
characteristics, write to Flash is slower than that to or a high-performance SSD. For the case of a low-
PCM, based on system-level characteristics, writes to performance SSD, executable files and program li-
a Flash-based SSD can be faster than that of a PCM- braries are moved to the SSD and the remaining
based SSD, due to reasons such as power constraints files are moved to HDD; also randomly accessed
etc. Based on this insight, they study two storage use- files are moved to SSD and the files with mixed- or
cases: tiering and caching. For tiering, they model a contiguous-access pattern are moved to HDD. When a
storage system consisting of Flash, HDD, and PCM high-performance SSD is used which has throughput
to identify the combinations of device types that offer nearly equal to that of HDD, the most frequently used
the best performance within cost constraints. They files are moved are SSD and the remaining files are
observe that PCM can improve the performance of a moved to HDD.
tiered storage system and certain combinations (e.g.
30% PCM + 67% Flash + 3% HDD combination)
can provide higher performance per dollar than a 4.6 SSD v/s HDD
combination without PCM. For caching, they compare Narayanan et al. [90] compare the performance, en-
aggregate I/O time and read latency of PCM with that ergy consumption and cost of hard-disks with that
of Flash. They observe that a combination of Flash of Flash-based SSDs in year 2009. For a range of
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 11
data-center workloads, they analyze both complete [85] or change of value stored in MLC PCM cell over
replacement of disks by SSDs, and use of SSDs as time due to resistance drift [10]. In near future, along
an intermediate tier between disks and DRAM. They with performance and energy, researchers need to also
observe that due to low capacity per dollar of SSDs, look into issues such as soft-error resilience of NVMs.
replacing disks by SSDs is not an optimal solution for At architecture level, synergistic integration of
their workloads. They find that the capacity/dollar of techniques for write-minimization, wear-leveling and
SSDs need to be improved by a factor of 3 to 3000 to fault-tolerance will help in achieving magnitude order
make them comparable with disks. improvement in lifetime of NVM memory systems. At
Albrecht et al. [8] performed similar study in year system-level, unification of memory/storage manage-
2013 and found that due to declining price of Flash ment in a single address space via hardware/software
along with relatively steady performance of disk, the cooperation can further minimize the overheads of the
break-even point has been shifting and Flash is now two-level storage model. Since computing systems of
becoming economical for a larger range of workloads. all sizes and shapes, ranging from milli-watt handheld
systems to megawatt data centers and supercomput-
4.7 NVMs v/s DRAM and HDD ers depend on efficient memory systems and present
different constraints and optimization objectives, ac-
Caulfield et al. [19] compare several memory tech-
counting for their unique features will be extremely
nologies, such as DRAM, NVMs and HDD. They
important to design techniques optimized for dif-
measure the impact of these technologies on I/O-
ferent platforms and usage scenarios. Finally, these
intensive, database, and memory-intensive applica-
technologies also need to be highly cost-competitive
tions which have varying latencies, bandwidth re-
to justify their use in commodity market.
quirements and access patterns. They also study the
In this paper, we presented a survey of techniques
effect of different options for connecting memory
which utilize NVMs for main memory and storage
technologies to the host system. They observe that
systems. We also discussed techniques which combine
NVMs provide large gains in both application-level
or compare multiple memory technologies to study
and raw I/O performance. For some applications,
their relative merits and bring the best of them to-
PCM and STT-RAM can provide a magnitude-order
gether. We classified these techniques on several key
improvement in performance over HDDs.
parameters to highlight their similarities and differ-
ences and identify major research trends. It is hoped
5 F UTURE C HALLENGES AND C ONCLUSION that this survey will inspire novel ideas for fully
In an era of data explosion, the storage and processing leveraging the potential of NVMs in future computing
demands on memory systems are increasing tremen- systems.
dously. While it is clear that conventional memory
technologies fall far short of the demands of future R EFERENCES
computing systems, NVM technologies, as they stand
[1] “Meet Gordon, the World’s First Flash Supercomputer,”
today, are also unable to meet the performance, energy www.wired.com/2011/12/gordon-supercomputer/, 2011.
efficiency and reliability targets for future systems. [2] “Flash Drives Replace Disks at Amazon, Facebook, Drop-
We believe that meeting these challenges will require box,” www.wired.com/2012/06/flash-data-centers/, 2012.
[3] http://goo.gl/qwyFHe, 2013.
effective management of NVMs at multiple layers of [4] “Oak Ridge to acquire next generation supercomputer,” http:
abstraction, ranging from device level to system level. //goo.gl/d315UD, 2014.
At device level, for example, 3D design can lead [5] www.youtube.com/yt/press/statistics.html, 2015.
[6] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. S.
to denser form factor, smaller footprint and lower Manasse, and R. Panigrahy, “Design Tradeoffs for SSD Per-
latencies [100]. Also, fabrication of higher capacity formance,” USENIX Annual Technical Conf., pp. 57–70, 2008.
SCM prototypes and their mass production will fuel [7] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and
S. Swanson, “Onyx: a protoype phase change memory stor-
further research into them. Although MLC NVM age array,” USENIX HotStorage, 2011.
provides higher density and lower price than SLC [8] C. Albrecht, A. Merchant, M. Stokely, M. Waliji, F. Labelle,
NVM, its endurance is generally two-three orders of N. Coehlo, X. Shi, and E. Schrock, “Janus: Optimal flash pro-
visioning for cloud storage workloads.” in USENIX Annual
magnitude lower than that of SLC NVM, for example, Technical Conference, 2013, pp. 91–102.
the endurance of a 70 nm SLC Flash is around 100K [9] F. A. Aouda, K. Marquet, and G. Salagnac, “Incremental
cycles, that of a 2-bit 2x nm MLC Flash is around 3K checkpointing of program state to NVRAM for transiently-
powered systems,” in Int. Symp. on Reconfigurable and
cycles while for 3-bit MCL Flash, this value is only a Communication-Centric Systems-on-Chip, 2014, pp. 1–4.
few hundred cycles [71]. Since the increasing demand [10] M. Awasthi, M. Shevgoor, K. Sudan, B. Rajendran, R. Balasub-
of memory capacity may necessitate the use of MLC ramonian, and V. Srinivasan, “Efficient scrub mechanisms for
error-prone emerging memories,” in International Symposium
NVM, effective software-schemes are required for mit- on High Performance Computer Architecture, 2012, pp. 1–12.
igating the NVM write overhead. Further, NVMs are [11] A. Badam and V. S. Pai, “SSDAlloc: hybrid SSD/RAM
generally considered immune to radiation-induced memory management made easy,” in USENIX conference on
Networked systems design and implementation, 2011, pp. 16–16.
soft errors, however, soft errors may appear in NVMs [12] K. A. Bailey, P. Hornyack, L. Ceze, S. D. Gribble, and
from stochastic bit-inversions due to thermal noise H. M. Levy, “Exploring storage class memory with key
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 12
value stores,” in Workshop on Interactions of NVM/FLASH with D. Burger, and D. Coetzee, “Better I/O through byte-
Operating Systems and Workloads (INFLOW). ACM, 2013, p. 4. addressable, persistent memory,” in ACM SIGOPS 22nd sym-
[13] M. Balakrishnan, A. Kadav, V. Prabhakaran, and D. Malkhi, posium on Operating systems principles, 2009, pp. 133–146.
“Differential RAID: rethinking RAID for SSD reliability,” [34] B. Cully, J. Wires, D. Meyer, K. Jamieson, K. Fraser, T. Dee-
ACM Transactions on Storage (TOS), vol. 6, no. 2, p. 4, 2010. gan, D. Stodden, G. Lefebvre, D. Ferstay, and A. Warfield,
[14] S. Boboila and P. Desnoyers, “Write endurance in flash drives: “Strata: Scalable high-performance storage on virtualized
Measurements and analysis.” in USENIX Conference on File non-volatile memory,” in USENIX Conference on File and
and Storage Technologies (FAST), vol. 10, 2010. Storage Technologies (FAST), 2014, pp. 17–31.
[15] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Error patterns in [35] C. Dirik and B. Jacob, “The performance of PC solid-state
MLC NAND flash memory: Measurement, characterization, disks (SSDs) as a function of bandwidth, concurrency, device
and analysis,” in Design, Automation & Test in Europe, 2012, architecture, and system organization,” in ACM SIGARCH
pp. 521–526. Computer Architecture News, vol. 37, no. 3, 2009, pp. 279–289.
[16] Y. Cai, O. Mutlu, E. F. Haratsch, and K. Mai, “Program [36] X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and
interference in MLC NAND flash memory: Characterization, Y. Xie, “Leveraging 3D PCRAM technologies to reduce check-
modeling, and mitigation,” in International Conference on Com- point overhead for future exascale systems,” in High Perfor-
puter Design (ICCD), 2013, pp. 123–130. mance Computing Networking, Storage and Analysis, 2009.
[17] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. S. Un- [37] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz,
sal, and K. Mai, “Flash correct-and-refresh: Retention-aware D. Reddy, R. Sankaran, and J. Jackson, “System software
error management for increased flash memory lifetime,” in for persistent memory,” in European Conference on Computer
International Conference on Computer Design, 2012, pp. 94–101. Systems (EuroSys), 2014.
[18] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, [38] R. Fackenthal et al., “A 16Gb ReRAM with 200MB/s write
A. Cristal, and K. Mai, “Neighbor-cell assisted error cor- and 1GB/s read in 27nm technology,” in IEEE International
rection for MLC NAND flash memories,” in International Solid-State Circuits Conference, 2014, pp. 338–339.
conference on Measurement and modeling of computer systems [39] Y. Fang, H. Li, and X. Li, “SoftPCM: Enhancing Energy
(SIGMETRICS), 2014, pp. 491–504. Efficiency and Lifetime of Phase Change Memory in Video
[19] A. M. Caulfield, J. Coburn, T. Mollov, A. De, A. Akel, Applications via Approximate Write,” in IEEE Asian Test
J. He, A. Jagatheesan, R. K. Gupta, A. Snavely, and S. Swan- Symposium (ATS), 2012, pp. 131–136.
son, “Understanding the impact of emerging non-volatile [40] M. Gamell, I. Rodero, M. Parashar, and S. Poole, “Exploring
memories on high-performance, IO-intensive computing,” energy and performance behaviors of data-intensive scientific
in International Conference for High Performance Computing, workflows on systems with deep memory hierarchies,” in Int.
Networking, Storage and Analysis, 2010, pp. 1–11. Conf. on High Performance Computing, 2013, pp. 226–235.
[20] A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, [41] T. Gao, K. Strauss, S. M. Blackburn, K. S. McKinley, D. Burger,
and S. Swanson, “Moneta: A high-performance storage array and J. Larus, “Using managed runtime systems to tolerate
architecture for next-generation, non-volatile memories,” in holes in wearable memories,” in Conf. on Programming Lan-
Int. Symp. on Microarchitecture, 2010, pp. 385–395. guage Design and Implementation (PLDI), 2013, pp. 297–308.
[21] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari, “Atlas: [42] E. Giles, K. Doshi, and P. Varman, “Bridging the program-
leveraging locks for non-volatile memory consistency,” Int. ming gap between persistent and volatile memory using
Conf. on Object Oriented Programming Systems Languages & WrAP,” in Int. Conf. on Computing Frontiers, 2013, p. 30.
Applications, pp. 433–452, 2014. [43] B. Giridhar, M. Cieslak, D. Duggal, R. Dreslinski, H. M. Chen,
[22] L.-P. Chang, “On efficient wear leveling for large-scale flash- R. Patti, B. Hold, C. Chakrabarti, T. Mudge, and D. Blaauw,
memory storage systems,” in ACM Symposium on Applied “Exploring DRAM organizations for energy-efficient and re-
computing (SAC), 2007, pp. 1126–1130. silient exascale memories,” in International Conference for High
[23] L.-P. Chang and L.-C. Huang, “A low-cost wear-leveling Performance Computing, Networking, Storage and Analysis, 2013.
algorithm for block-mapping solid-state disks,” SIGPLAN [44] L. M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson,
Not., vol. 46, no. 5, pp. 31–40, 2011. E. Yaakobi, P. H. Siegel, and J. K. Wolf, “Characterizing flash
[24] Y.-H. Chang, J.-W. Hsieh, and T.-W. Kuo, “Endurance en- memory: anomalies, observations, and applications,” in Int.
hancement of flash-memory storage systems: an efficient Symp. on Microarchitecture, 2009, pp. 24–33.
static wear leveling design,” in DAC, 2007, pp. 212–217. [45] L. M. Grupp, J. D. Davis, and S. Swanson, “The bleak future
[25] F. Chen, D. A. Koufaty, and X. Zhang, “Understanding intrin- of NAND flash memory,” in USENIX Conference on File and
sic characteristics and system implications of flash memory Storage Technologies (FAST), 2012.
based solid state drives,” in ACM SIGMETRICS Performance [46] A. Gupta, Y. Kim, and B. Urgaonkar, “DFTL: A Flash Trans-
Evaluation Review, vol. 37, no. 1, 2009, pp. 181–192. lation Layer Employing Demand-based Selective Caching
[26] F. Chen, D. A. Koufaty, and X. Zhang, “Hystor: making the of Page-level Address Mappings,” Architectural Support for
best use of solid state drives in high performance storage Programming Languages and Operating Syst., pp. 229–240, 2009.
systems,” in Int. Conf. on Supercomputing, 2011, pp. 22–32. [47] P. Huang, P. Subedi, X. He, S. He, and K. Zhou, “FlexECC:
[27] F. Chen, R. Lee, and X. Zhang, “Essential roles of exploiting partially relaxing ECC of MLC SSD for better cache perfor-
internal parallelism of flash memory based solid state drives mance,” USENIX Annual Technical Conf., pp. 489–500, 2014.
in high-speed data processing,” in International Symposium on [48] P. Huang, G. Wan, K. Zhou, M. Huang, C. Li, and H. Wang,
High Performance Computer Architecture, 2011, pp. 266–277. “Improve effective capacity and lifetime of solid state drives,”
[28] F. Chen, T. Luo, and X. Zhang, “CAFTL: A Content-Aware in International Conference on Networking, Architecture and Stor-
Flash Translation Layer Enhancing the Lifespan of Flash age (NAS), 2013, pp. 50–59.
Memory based Solid State Drives,” in USENIX Conference on [49] S. Im and D. Shin, “Differentiated space allocation for wear
File and Storage Technologies (FAST), vol. 11, 2011. leveling on phase-change memory-based storage device,”
[29] J. Chen, G. Venkataramani, and H. H. Huang, “RePRAM: IEEE Trans. Consum. Electron., vol. 60, no. 1, pp. 45–51, 2014.
Re-cycling PRAM faulty blocks for extended lifetime,” in Int. [50] E. Ipek, J. Condit, E. B. Nightingale, D. Burger, and T. Mosci-
Conf. on Dependable Systems and Networks, 2012, pp. 1–12. broda, “Dynamically replicated memory: Building reliable
[30] S. Chen, P. B. Gibbons, and S. Nath, “Rethinking database systems from nanoscale resistive memories,” in Architectural
algorithms for phase change memory.” CIDR, pp. 21–31, 2011. Support for Programming Languages and Operating Systems
[31] S. Cho, C. Park, H. Oh, S. Kim, Y. Yi, and G. R. Ganger, (ASPLOS), 2010, pp. 3–14.
“Active disk meets flash: a case for intelligent SSDs,” in [51] J. Jeong, S. S. Hahn, S. Lee, J. Kim, J. Jeong, S. S. Hahn, S. Lee,
International conference on supercomputing, 2013, pp. 91–102. and J. Kim, “Lifetime improvement of NAND flash-based
[32] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. storage systems using dynamic program and erase scaling,”
Gupta, R. Jhala, and S. Swanson, “NV-Heaps: making persis- in FAST, 2014, pp. 61–74.
tent objects fast and safe with next-generation, non-volatile [52] J.-Y. Jung and S. Cho, “Memorage: emerging persistent ram
memories,” in ACM SIGARCH Computer Architecture News, based malleable main memory and storage architecture,” in
vol. 39, no. 1, 2011, pp. 105–118. Int. Conf. on supercomputing, 2013, pp. 115–126.
[33] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, [53] M. Jung, W. Choi, J. Shalf, and M. T. Kandemir, “Triple-A:
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 13
a Non-SSD based autonomic all-flash array for high perfor- [77] R.-S. Liu, C.-L. Yang, C.-H. Li, and G.-Y. Chen, “DuraCache:
mance storage systems,” Architectural support for programming A durable SSD cache using MLC NAND flash,” in Design
languages and operating systems, pp. 441–454, 2014. Automation Conference, 2013, p. 166.
[54] S. Jung, Y. Lee, and Y. Song, “A process-aware hot/cold [78] R.-S. Liu, C.-L. Yang, and W. Wu, “Optimizing NAND flash-
identification scheme for flash memory storage systems,” based SSDs via retention relaxation,” USENIX Conference on
IEEE Trans. Consum. Electron., vol. 56, no. 2, pp. 339–347, 2010. File and Storage Technologies (FAST), 2012.
[55] S. Kannan, A. Gavrilovska, and K. Schwan, “Reducing the [79] Z. Liu, B. Wang, P. Carpenter, D. Li, J. S. Vetter, and W. Yu,
cost of persistence for nonvolatile heaps in end user devices,” “PCM-Based durable write cache for fast disk I/O,” in Int.
in Int. Symp. On High Performance Computer Architecture, 2014. Symp. on Modeling, Analysis & Simulation of Computer and
[56] S. Kannan, A. Gavrilovska, K. Schwan, and D. Milojicic, Telecommunication Systems (MASCOTS), 2012, pp. 451–458.
“Optimizing checkpoints using NVM as virtual memory,” in [80] R. Maddah, S. Cho, and R. Melhem, “Data dependent spar-
Int. Symp. on Parallel & Distributed Processing, 2013, pp. 29–40. ing to manage better-than-bad blocks,” Computer Architecture
[57] S. Kannan, A. Gavrilovska, K. Schwan, D. Milojicic, and Letters, vol. 12, no. 2, pp. 43–46, 2013.
V. Talwar, “Using active NVRAM for I/O staging,” in Petascal [81] J. Matthews, S. Trika, D. Hensgen, R. Coulson, and K. Grim-
data analytics: challenges and opportunities, 2011, pp. 15–22. srud, “Intel® turbo memory: Nonvolatile disk caches in the
[58] T. Kgil, D. Roberts, and T. Mudge, “Improving NAND flash storage hierarchy of mainstream computer systems,” ACM
based disk caches,” in International Symposium on Computer Transactions on Storage (TOS), vol. 4, no. 2, pp. 4:1–4:24, 2008.
Architecture, 2008, pp. 327–338. [82] J. Meza, Y. Luo, S. Khan, J. Zhao, Y. Xie, and O. Mutlu, “A case
[59] H. Kim and S. Ahn, “BPLRU: A Buffer Management Scheme for efficient hardware/software cooperative management of
for Improving Random Writes in Flash Storage,” USENIX storage and memory,” Worksh. on Energy Efficient Design, 2013.
Conf. on File and Storage Technologies, vol. 8, pp. 1–14, 2008. [83] S. Mittal, “A survey of power management techniques for
[60] H. Kim, S. Seshadri, C. L. Dickey, and L. Chiu, “Evaluating phase change memory,” International Journal of Computer
phase change memory for enterprise storage systems: a study Aided Engineering and Technology (IJCAET), 2014.
of caching and tiering approaches,” in USENIX Conference on [84] S. Mittal and J. Vetter, “A Survey of Methods for Analyzing
File and Storage Technologies (FAST), 2014, pp. 33–45. and Improving GPU Energy Efficiency,” ACM Computing
[61] H. Kim, J. Ahn, S. Ryu, J. Choi, and H. Han, “In-memory file Surveys, 2015.
system for non-volatile memory,” in Research in Adaptive and [85] S. Mittal and J. Vetter, “A Survey of Techniques for Modeling
Convergent Systems. ACM, 2013, pp. 479–484. and Improving Reliability of Computing Systems,” IEEE
[62] Y. Kim, A. Gupta, B. Urgaonkar, P. Berman, and A. Sivasub- Transactions on Parallel and Distributed Systems (TPDS), 2015.
ramaniam, “HybridStore: A cost-efficient, high-performance [86] S. Mittal, J. S. Vetter, and D. Li, “A Survey Of Architec-
storage system combining SSDs and HDDs,” in International tural Approaches for Managing Embedded DRAM and Non-
Symposium on Modeling, Analysis & Simulation of Computer and volatile On-chip Caches,” IEEE Transactions on Parallel and
Telecommunication Systems (MASCOTS), 2011, pp. 227–236. Distributed Systems (TPDS), 2014.
[63] R. Koller, L. Marmol, R. Rangaswami, S. Sundararaman, [87] J. C. Mogul, E. Argollo, M. A. Shah, and P. Faraboschi,
N. Talagala, and M. Zhao, “Write policies for host-side flash “Operating System Support for NVM+ DRAM Hybrid Main
caches,” USENIX Conf. File and Stor. Technol., pp. 45–58, 2013. Memory,” in HotOS, 2009.
[64] I. Koltsidas and S. D. Viglas, “Flashing up the storage layer,” [88] I. Moraru, D. G. Andersen, M. Kaminsky, N. Tolia, P. Ran-
VLDB Endowment, vol. 1, no. 1, pp. 514–525, 2008. ganathan, and N. Binkert, “Consistent, durable, and safe
[65] M. Kryder and C. Kim, “After hard drives–what comes memory management for byte-addressable non volatile main
next?” IEEE Trans. Magn., vol. 45, no. 10, pp. 3406–3413, 2009. memory,” TRIOS, 2013.
[66] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, [89] D. Narayanan and O. Hodson, “Whole-system persistence,”
D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, in ACM SIGARCH Computer Architecture News, vol. 40, no. 1,
C. Wells, and B. Zhao, “OceanStore: An architecture for 2012, pp. 401–410.
global-scale persistent storage,” ACM Sigplan Notices, vol. 35, [90] D. Narayanan, E. Thereska, A. Donnelly, S. Elnikety, and
no. 11, pp. 190–201, 2000. A. Rowstron, “Migrating server storage to SSDs: analysis of
[67] E. Lee, S. Yoo, J.-E. Jang, and H. Bahn, “Shortcut-JFS: A write tradeoffs,” in European Conf. Comput. Syst., 2009, pp. 145–158.
efficient journaling file system for phase change memory,” [91] Y. Oh, J. Choi, D. Lee, and S. H. Noh, “Caching less for
Symp. on Mass Storage Systems and Technologies, pp. 1–6, 2012. better performance: balancing cache size and update cost of
[68] J. Lee, Y. Kim, G. M. Shipman, S. Oral, F. Wang, and J. Kim, flash memory cache in hybrid storage systems,” in USENIX
“A semi-preemptive garbage collector for solid state drives,” Conference on File and Storage Technologies (FAST), vol. 12, 2012.
in Int. Symp. on Performance Analysis of Systems and Software, [92] Y. Pan, G. Dong, Q. Wu, and T. Zhang, “Quasi-nonvolatile
2011, pp. 12–21. SSD: Trading flash memory nonvolatility to improve storage
[69] S. Lee, J. Kim, and A. Mithal, “Refactored Design of I/O system performance for enterprise applications,” in Int. Symp.
Architecture for Flash Storage,” Comput. Archit. Lett., 2014. on High Performance Computer Architecture, 2012, pp. 1–10.
[70] S.-W. Lee and B. Moon, “Design of flash-based DBMS: an [93] A. Pande et al., “Video delivery challenges and opportunities
in-page logging approach,” in ACM SIGMOD international in 4G networks,” IEEE MultiMedia, vol. 20, no. 3, pp. 88–94,
conference on Management of data, 2007, pp. 55–66. 2013.
[71] S. Lee, T. Kim, K. Kim, and J. Kim, “Lifetime management of [94] V. Papirla and C. Chakrabarti, “Energy-aware error control
flash-based SSDs using recovery-aware dynamic throttling,” coding for flash memories,” in Design Automation Conference,
in FAST, 2012, p. 26. 2009, pp. 658–663.
[72] D. Li, J. S. Vetter, G. Marin, C. McCurdy, C. Cira, Z. Liu, and [95] S. Park, Y. Kim, B. Urgaonkar, J. Lee, and E. Seo, “A compre-
W. Yu, “Identifying opportunities for byte-addressable non- hensive study of energy efficiency and performance of flash-
volatile memory in extreme-scale scientific applications,” in based SSD,” Journal of Systems Architecture, vol. 57, no. 4, pp.
Int. Parallel & Distributed Processing Symp., 2012, pp. 945–956. 354–365, 2011.
[73] J. Liao, F. Zheng, L. Li, and G. Xiao, “Adaptive wear-leveling [96] S. Park and K. Shen, “A performance evaluation of scientific
in flash-based memory,” Computer Architecture Letters, 2014. I/O workloads on flash-based SSDs,” in IEEE International
[74] K. Liu, X. Zhang, K. Davis, and S. Jiang, “Synergistic coupling Conference on Cluster Computing and Workshops, 2009, pp. 1–5.
of SSD and hard disk for QoS-aware virtual memory,” in Int. [97] P. Pavan, R. Bez, P. Olivo, and E. Zanoni, “Flash memory
Symp. Perform. Analysis of Syst. and Software, 2013, pp. 24–33. cells-an overview,” Proceedings of the IEEE, vol. 85, no. 8, pp.
[75] N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, 1248–1271, 1997.
A. Crume, and C. Maltzahn, “On the role of burst buffers [98] H. Payer, M. A. Sanvido, Z. Z. Bandic, and C. M. Kirsch,
in leadership-class storage systems,” in Symposium on Mass “Combo drive: Optimizing cost and performance in a hetero-
Storage Systems and Technologies (MSST), 2012, pp. 1–11. geneous storage device,” in Workshop on Integrating Solid-state
[76] R.-S. Liu, D.-Y. Shen, C.-L. Yang, S.-C. Yu, and C.-Y. M. Memory into the Storage Hierarchy, vol. 1, 2009, pp. 1–8.
Wang, “NVM duet: unified working memory and persistent [99] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory persis-
store architecture,” in Architectural support for programming tency,” in Int. Symp. on Comput. Archit., 2014, pp. 265–276.
languages and operating systems, 2014, pp. 455–470. [100] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “DES-
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTING SYSTEMS 14