0% found this document useful (0 votes)
48 views12 pages

ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture With Low Mapping Overhead

ECC-Map is a new wear-leveling architecture proposed for non-volatile memory devices that addresses the challenge of limited endurance in large-capacity devices. It uses efficiently computable mapping functions and a sliding window approach to selectively remap heavily-written addresses while controlling mapping costs. Evaluation shows ECC-Map significantly outperforms existing approaches in leveling unbalanced workloads and its advantage increases as the device size to endurance ratio grows.

Uploaded by

Sudip Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views12 pages

ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture With Low Mapping Overhead

ECC-Map is a new wear-leveling architecture proposed for non-volatile memory devices that addresses the challenge of limited endurance in large-capacity devices. It uses efficiently computable mapping functions and a sliding window approach to selectively remap heavily-written addresses while controlling mapping costs. Evaluation shows ECC-Map significantly outperforms existing approaches in leveling unbalanced workloads and its advantage increases as the device size to endurance ratio grows.

Uploaded by

Sudip Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture

with Low Mapping Overhead


Natan Peled Yuval Cassuto
Viterbi Department of ECE Viterbi Department of ECE
Technion - Israel Institute of Technology Technion - Israel Institute of Technology
Haifa, Israel Haifa, Israel
[email protected] [email protected]

ABSTRACT more dominant role in the future. In those systems, it is expected


New non-volatile memory technologies show great promise for that commercial NV memory devices will have large capacities,
extending the memory hierarchy, but have limited endurance that small data units (lines) for fast access, and limited endurance due
needs to be mitigated toward their reliable use closer to the pro- to low cost/high density.
cessor. Wear leveling is a common technique for prolonging the As NV memories take upon the demanding workloads of data-
life of endurance-limited memory, where existing wear-leveling intensive applications, concern is raised about their limited en-
approaches either employ costly full-indirection mapping between durance. This concern is not new: early phase-change memory
logical and physical addresses, or choose simple mappings that (PCM) based main-memory architectures already addressed their
cannot cope with extremely unbalanced write workloads. In this limited endurance by devising wear-leveling mechanisms. How-
work, we propose ECC-Map, a new wear-leveling device archi- ever, the existing solutions are not sufficient to mitigate the problem
tecture that can level even the most unbalanced and adversarial in the scaling trend of growing memory capacities and shrinking
workloads, while enjoying low mapping complexity compared to endurances (due to increased density). In fact, the wear-leveling
full indirection. Its key idea is using a family of efficiently com- problem becomes dramatically more challenging as the capacity
putable mapping functions allowing to selectively remap heavily grows or as the endurance decreases (and doubly so if both hap-
written addresses, while controlling the mapping costs by limiting pen simultaneously). Consequently, we revisit the wear-leveling
the number of functions used at any given time. ECC-Map is evalu- problem in this work, and propose a new device architecture we call
ated on common synthetic workloads, and is shown to significantly ECC-Map to support flexible wear leveling with low implementation
outperform existing wear-leveling architectures. The advantage costs.
of ECC-Map grows with the device’s size-to-endurance ratio, a Even though Flash devices suffer from a similar endurance prob-
parameter that is expected to grow in the scaling trend of growing lem [21], existing solutions for Flash devices do not fit persistent
capacities and shrinking reliabilities. NV memories. Writing in Flash devices is split between program
and erase operations [11], where each works at a different gran-
CCS CONCEPTS ularity. The effect of this is that individual lines (“pages” in the
Flash terminology) cannot be re-written in-place, which necessi-
• Hardware → Non-volatile memory; Hardware reliability; Anal-
tates out-of-place writing and frequent data movements (garbage
ysis and design of emerging devices and systems; Memory and dense
collection) [17]. For that purpose, most Flash storage products im-
storage; • Information systems → Storage class memory.
plement a full-indirection translation layer in their controllers. Full
indirection using mapping tables makes the wear-leveling problem
KEYWORDS easier to handle, but with the high cost of using large tables in ex-
Non-volatile memory, persistent memories, wear-leveling, error- pensive controller memory. In the case of persistent NV memories,
correcting codes. for example in PCM, the device cells do not have to be erased before
writing to them. In addition, the access granularity is typically finer
1 INTRODUCTION than in Flash devices. Those properties obviate the need for garbage
collection, allowing to use more economical mechanisms for wear
Computing systems are challenged by the growing memory de- leveling compared to full indirection tables.
mands of data-intensive applications. These demands grow faster Toward a clear presentation of the problem and our ECC-Map
than the scaling of DRAM, the principal main-memory technol- architecture, we use a simple memory-device model. The device is
ogy [13]. Thus, new memories, called persistent non-volatile (NV) accessed by a host computing system through a read/write interface
memories are being designed and deployed in emerging comput- spanning a linear range of logical addresses, and it employs an inter-
ing architectures. These memories have lower cost than DRAM nal controller for managing the read/write of the physical memory.
memories, and faster access than non-volatile storage technolo- The device-host read/write interface uses data units we call lines.
gies such as NAND Flash. The design challenge of NV memories From the host side, a line is addressed by its logical line address
lies in their “sandwich” status: having both performance expecta- (LLA), and internally a line is stored in a physical line address (PLA).
tions of main memory and cost expectations of backing storage. The device has 𝑁 PLAs in total, and every PLA can be written at
NV memories already exist in a variety of technologies such as most 𝑤𝑚𝑎𝑥 times in the device lifetime. The basic need that the
PCM, RERAM, MRAM [1, 9, 20], and others, but it is expected that device architecture needs to fulfil is mapping flexibility between
scaled-up versions of these (or other) technologies will find an even
Natan Peled and Yuval Cassuto

LLAs and PLAs, such that even if LLA write loads are extremely (2) A sliding window bounding the range of mapping indices
unbalanced, no PLA will exceed its endurance limit 𝑤𝑚𝑎𝑥 prema- used throughout the device at a given time. The window
turely. In addition to mapping flexibility, the architecture needs to size controls the mapping complexity.
specify the wear-leveling algorithms governing the evolution of this (3) Selective remapping of specific logical addresses from
mapping in the device lifetime, and the internal data movements. their current physical locations to a new location deter-
A well-known wear-leveling architecture, called Start-Gap (SG) [14], mined by a subsequent mapping index.
uses a very economical mapping structure and allowing basic ad- (4) A remapping trigger invoked when a physical location
dress remappings called gap movements. The gap is a spare PLA, reaches a designated wear threshold based on either a write-
and its movement is done by writing into it the LLA residing in count estimate or reliability estimate.
the adjacent PLA. While SG has demonstrated good wear-leveling (5) Mapping-index randomization to prevent an adversary
performance in some natural workloads, in extremely unbalanced from tracking mapping pairs and generating harmful adap-
workloads its performance is not satisfactory. In particular, when tive workloads.
𝑤𝑚𝑎𝑥 is not orders of magnitudes larger than 𝑁 , SG fails to ade- Ingredients 1-4 are new, to the best of our knowledge, while ingre-
quately level an adversarial workload that continuously writes to a dient 5 is a commonly used security measure.
single LLA, which we call the 1-LLA workload in this paper. This is We provide the details of the ECC-Map architecture in Section 2,
a major potential impediment in practice given the earlier stated and in Section 3 we present the performance evaluation of ECC-
trend of growing 𝑁 and shrinking 𝑤𝑚𝑎𝑥 . Some mitigations have Map on a device discrete-event simulation. The results show that
been proposed for this undesired behavior, but none of them fully high device utilizations can be reached even for 𝑁 /𝑤𝑚𝑎𝑥 ratios
solve the problem. Dividing the device to regions is a common useful that fail prior wear-leveling architectures. The promising results
proposition made in [14, 15, 26] (and others); but with the blessing of of the modeled device in the simulation environment motivate the
breaking 𝑁 into small “mini-devices” comes the curse (and complex- future implementation of ECC-Map in a hardware device within a
ity) of simultaneously managing many such mini-devices. Other working computing system. For that, in Section 4 we discuss finer
proposed approaches to deal with extremely unbalanced workloads implementation details toward a more practical realization of the
have been to cache those writes in an unlimited-endurance media, architecture in hardware.
or better yet, to block the writes from the device by caching them
“elsewhere”. While these may work in specific system settings, a 2 THE MAPPING ARCHITECTURE
stand-alone persistent-memory device can assume neither of the
It is imperative upon a memory device to maximize its ability to
two, and must guarantee adequate wear leveling on its own.
serve host writes before reaching its physical endurance limits. In
The new device architecture we present in this paper aims to
this work, we assume an endurance model whereby each PLA is
solve the wear-leveling problem by enhancing the flexibility of the
limited to 𝑤𝑚𝑎𝑥 total physical writes, and once exceeded, it cannot
LLA-PLA mapping, while keeping the costs associated with this
be used anymore for read or write. To guarantee full usability, we
new mapping small and controlled. The fundamental problem of
define the device’s lifetime as the time until any PLA exceeds 𝑤𝑚𝑎𝑥
wear-leveling mapping architectures is that they must be able to
writes. Hence, the key performance objective in this work is to
map frequently-written LLAs flexibly across the PLA space, and
maximize the total number of host writes served by the device in
tracking a flexible workload-dependent mapping costs memory and
its lifetime. Let 𝑁 be the number of PLAs in the device, that is, its
processing resources. If any LLA can be heavily written, resulting
physical capacity in units of lines. Then 𝑤𝑚𝑎𝑥 · 𝑁 is a fundamental
in its repeated remapping within the entire PLA space, it appears
upper bound on the number of host writes served within the device
intuitively necessary for the device to keep for each LLA its current
lifetime. With respect to this fundamental bound, we define the
full PLA address. Storing and maintaining such a mapping entails
writing utilization as
high cost (memory space) and complexity (persisting and/or wear-
leveling meta-data), which we would like to avoid for commercial #𝐻𝑜𝑠𝑡 𝑤𝑟𝑖𝑡𝑒𝑠
𝑈 𝑡𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 = , (1)
viability. Fortunately, contradicting this intuition, we show in the 𝑤𝑚𝑎𝑥 · 𝑁
sequel a mapping architecture that is able to relocate any LLA which corresponds to a specific write workload served by the device.
across the entire PLA space, without maintaining costly LLA to The utilization is a number between 0 and 1, the higher the better.
PLA mapping. Note that the utilization measure accounts for the full storage cost 𝑁 ,
The main idea of ECC-Map is to use a family of mapping functions even in the case of over-provisioned physical storage 𝑁 > 𝐾, where
to maintain the LLA-PLA mapping. A large family of functions gives 𝐾 is the number of LLAs. Furthermore, the numerator counts host
more flexibility than the simple functions used in prior work, and writes and not physical writes (the latter include internal writes by
at the same time more efficiency than offered by a mapping table the mapping layer), and thus (1) captures the true utility offered by
in memory. Around these mapping functions we design the entire the device to the customer. In contrast, some prior works (e.g., [14])
mapping architecture and its algorithms, which we summarize in use performance measures that count the total number of physical
the following by stating ECC-Map’s main ingredients. writes (including internal writes), and are thus not valid in cases of
significant write amplification. Later in the paper we evaluate the
(1) A family of efficiently computable mapping functions utilization for several important workloads, focusing primarily on
with properties allowing effective reclaiming of unused notoriously challenging workloads.
wear. Each member of the family is defined by an integer Improving the utilization by wear leveling is made possible by
mapping index. implementing a mapping layer that spreads the uneven LLA access
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead

m k−m m
more evenly across the PLA space. The mapping layer maintains a
dynamic mapping function between the 𝐾 LLAs and the 𝑁 PLAs; LLA index PLA
in general 𝑁 ≥ 𝐾, and we define 𝜌 = (𝑁 − 𝐾)/𝑁 ≥ 0 as the spare
factor of the device. At any point in time, the mapping function (a) Forward mapping.
needs to be injective, that is, not mapping multiple LLAs to the same k−m m m
PLA. The function needs not be surjective, that is, not all PLAs need
to map to LLAs. The simplest but most costly implementation of
index PLA LLA
the mapping function is by a mapping table having an entry for
(b) Inverse mapping.
each LLA storing its mapped PLA. A mapping table can wear level
effectively, but its hardware and maintenance costs are prohibitive.
Figure 1: (a) Forward mapping. The input LLA and the func-
Much more efficient mapping layers use a global efficiently com-
tion index (in white) comprise the input to the ECC encoder.
putable function: 𝑃𝐿𝐴 = 𝑓 (𝐿𝐿𝐴), where 𝑓 (·) changes in time, and
The encoder’s output (shaded orange) gives the PLA resulting
storing its specification can be done with little memory. The key of
from the mapping. (b) Inverse mapping. PLA and LLA ex-
the proposed mapping architecture of this work is to extend this
change roles: PLA is now part of the input (in white) and LLA
from a single function 𝑓 (·) to a family of functions {𝑓𝑖 (·)}, allowing
is the output (shaded blue). Since this layout is a cyclic shift
different LLAs to be mapped using different mapping indices 𝑖. The
from the forward mapping, the exact same encoder function
combined mapping function, which maps every LLA to the PLA
can be used.
output by the function 𝑓𝑖 (·) designated for it, needs to be an injec-
tive mapping at every given time. Adding the index to the mapping
function improves its flexibility to level the wear, while bounding get (from linearity) a third codeword all of whose non-zeros are
the mapping cost is possible by limiting the range of indices used confined to 𝑚 or less consecutive coordinates, which contradicts a
throughout the LLA space at any given time. known property of cyclic codes with redundancy 𝑟 = 𝑚.
The inverse mapping (from PLA and index to LLA) is shown in
2.1 Implementation of the Mapping Functions Figure 1b: the input is [𝑖 |𝑃𝐿𝐴] and the output is 𝐿𝐿𝐴. From the
cyclic property of the code and the fact that Figure 1b is obtained
To implement the family of functions, we use encoding functions
from Figure 1a by a cyclic shift of 𝑚 positions to the left, the in-
of cyclic error-correcting codes (ECC), used elsewhere for error
verse mapping can use the same encoding function used by the
correction and detection, including as cyclic redundancy check
forward mapping, but note the required reordering of the input
(CRC) [12] codes. We choose these functions for the several advan-
arguments [𝑖 |𝑃𝐿𝐴] vs. [𝐿𝐿𝐴|𝑖]. This family of functions also enjoys
tages they offer: 1) efficient hardware implementation, 2) simple
the following very useful property (proved in the Appendix).
reverse mapping, and 3) spreading an LLA mapping across the
entire PLA space (as we detail later). The third feature is critical Property 2. For any 0 ≤ 𝑖 < 𝑗 < 𝑁 , 𝑓𝑖 (𝐿𝐿𝐴) ≠ 𝑓 𝑗 (𝐿𝐿𝐴) for
for obtaining high utilization in adversarial workloads, and is in every 𝐿𝐿𝐴.
general not satisfied by alternative options such as cryptographic
pseudo-random permutations. Common cryptographic functions The importance of Property 2 is that a single LLA does not return
also require significantly higher computation load relative to cyclic to the same PLA before reaching index 𝑖 = 𝑁 . This allows the wear-
ECC encoding. leveling scheme using this mapping to utilize the endurance of all
Assuming 𝑁 (the number of PLAs) is an integer power of 2, we 𝑁 PLAs even if a single LLA is written by the host.
define an integer parameter 𝑚 = log2 𝑁 . For the function family
we take a binary cyclic ECC with parameters [𝑛, 𝑘], where 𝑛 is the 2.2 Sliding Window of Mapping Indices
codeword length and 𝑘 is the number of information bits input The large number (≥ 𝑁 ) of mapping indices supported by the
to the encoder. 𝑟 = 𝑛 − 𝑘 is the redundancy of the code, and the proposed mapping functions is clearly useful for effective spreading
code is specified by a binary generator polynomial of degree 𝑟 . A of the write load throughout the entire device. However, toward
convenient source for such codes is the family of primitive BCH limiting the resources consumed by the mapping, we restrict all
codes that exist for a rich variety of [𝑛, 𝑘] combinations; some LLAs to have mapping indices in a subset of 𝑆 consecutive indices.
sample generator polynomials of BCH codes can be found in [7]. This subset changes as a sliding window throughout the device
We choose a code with 𝑟 = 𝑚 and 𝑘 ≥ 2𝑚. The input to the encoder lifetime. That is, the mapping-index set is {base, base + 1, . . . , base +
is the binary vector [𝐿𝐿𝐴|𝑖], where | represents concatenation and 𝑖 𝑆 − 1}, for some integer base, and every LLA has a mapping index
is the mapping index. 𝐿𝐿𝐴 is represented as an 𝑚-bit vector and 𝑖 as 𝑖 = base + offset𝑖 , (2)
a (𝑘 − 𝑚)-bit vector, both using the standard binary representation.
The encoding is depicted in Figure 1a. The output of the encoder is where offset𝑖 ∈ {0, . . . , 𝑆 − 1}. This allows the mapping architecture
an 𝑟 = 𝑚-bit representation of the output 𝑃𝐿𝐴. Using the encoder to keep the value of base in a global register, and represent the index
as specified, we get the following. 𝑖 compactly as offset𝑖 , which takes only log2 𝑆 bits. Figure 2a depicts
this restriction of indices to a sliding window of size 𝑆 = 4. 𝑆 is a
Property 1. 𝑓𝑖 (·) is an injective function for every index 𝑖.
design parameter of the architecture: large 𝑆 allows more flexibility
Property 1 is proven by contradiction: assume there are two for selective remapping, but also increases the complexity and/or
different LLAs mapping to the same PLA for the same index. Then costs of maintaining the mapping. We defer to Section 4 the detailed
by subtracting the corresponding two codewords (modulo 2), we discussion of the effect of 𝑆 on mapping complexity. In the mean
Natan Peled and Yuval Cassuto

Function 𝑓! 𝑓" 𝑓# 𝑓$ 𝑓% 𝑓&'" differentiation makes the remapping procedures (discussed next)
simpler and more deterministic. The value of 𝜙 is an optimization
Index 0 1 2 3 4 … 𝑀−1
variable, set to vacate a worn PLA “just in time” to keep it usable
(a) Mapping sliding window. for all future remappings. In Section 3 we specify a formula for the
value of 𝜙, derived based on analysis of ECC-Map that is included
Function 𝑓! 𝑓" 𝑓# 𝑓$ 𝑓% 𝑓&'"
in the appendix. Although 𝜙 is given as a count of physical writes,
Index 0 1 2 3 4 … 𝑀−1 implementing the remapping trigger does not require maintaining
PLA write counters (which would be expensive). Instead, reaching
(b) Mapping sliding window after a movement. a threshold of 𝜙 can be detected by a reliability measurement of the
PLA, for example by counting the number of bit errors corrected
Figure 2: Sliding window of active mapping indices, for the by the decoder of the data error-correcting codes.
case 𝑆 = 4. (a) Initial set in shaded orange when base = 0. (b)
After incrementing to base = 1, the set moves to the window
in shaded blue. 2.3.2 Regular remapping. The vast majority of remapping oper-
ations follow the simple procedure we describe next. When 𝐿𝐿𝐴
triggers remapping, it is moved from mapping index 𝑖 to 𝑖 + 1. If
time, we note a practical disadvantage of using offset𝑖 (as in (2)) for 𝑓𝑖+1 (𝐿𝐿𝐴) is an unused PLA, the remapping is complete – we call
representing the index, due to the need to update offset𝑖 when the this a non-colliding regular remapping. In a colliding regular remap-
window slides to a subsequent base, even if 𝑖 is unchanged. Instead, ping, before writing 𝐿𝐿𝐴 to 𝑓𝑖+1 (𝐿𝐿𝐴), 𝐿𝐿𝐴′ currently mapped to
we propose an alternative compact representation: 𝑖¯ = 𝑖 mod 𝑆, this PLA with index 𝑗 is remapped to index 𝑗 + 𝛿, and 𝛿 is the
which can be used to recover 𝑖 using the formula smallest positive integer such that 𝑓 𝑗+𝛿 (𝐿𝐿𝐴′ ) is an unused PLA.
𝑖 = base + ((𝑖¯ − base) mod 𝑆). (3) The procedure guarantees the movement of 𝐿𝐿𝐴 to index 𝑖 + 1, and
in case of collision, moves 𝐿𝐿𝐴′ out to a free PLA. The rationale
It is clear that 𝑖¯ remains unchanged if 𝑖 is unchanged, even if base is behind giving 𝐿𝐿𝐴 the priority over 𝐿𝐿𝐴′ is to minimize the index
increased. For example, an LLA mapped with 𝑖 = 3 in Figure 2a has increase of host-written LLAs, thus allowing more writes before an
𝑖¯ = 3, which remains the same even after the window movement to LLA exits the index window.
base = 1 in Figure 2b. In this latter state, 𝑖 can be recovered by (3):
𝑖 = 1 + (3 − 1)mod 4 = 3.
2.3.3 Catch-up remapping. When an LLA needs to move during
2.3 Selective Remapping regular remapping to an index equal to or greater than base + 𝑆,
a catch-up procedure is invoked. In the catch-up procedure we
The unique feature of the proposed architecture is that different
first set a new base value greater than the current one, and then
LLAs can be mapped by different mapping functions (mapping
remap every LLA with index smaller than the new base to an index
indices). This feature allows selective remapping of heavily writ-
greater or equal to it. The amount by which base is shifted, as well
ten LLAs by incrementing their mapping index, while keeping
as the new indices chosen for the catching-up LLAs, are a matter
other LLAs at their current mapping indices and physical locations.
for optimization. In this work we simply set base ← base + 𝑆, and
Thanks to the device over-provisioning (𝑁 > 𝐾), it is possible to
remap all LLAs with smaller indices to the new base index. Other
remap an LLA with minimal change to the mapping of other LLAs.
catch-up algorithms (not used in this work) may alternatively add
A heavier remapping operation, called catch-up, occurs when the
less than 𝑆 to base, and/or move LLAs to indices strictly beyond
remapped LLA’s mapping index is incremented beyond the cur-
the new base.
rent index window. In this case, the index window needs to shift,
Figure 3a and Figure 3b illustrate the operations of regular remap-
and with it will move all the LLAs that are currently mapped by
ping described above. The plotted arrays represent the PLA space of
indices below its new base index. However, since 𝑆 ≫ 1, catch-up
the device, and the capital letters in the array are the LLAs mapped
events reflect a minuscule minority of the remapping events. We
to the corresponding PLAs. Initially, LLA 𝐴 is mapped by index 𝑖 as
give some more details on the remapping operations, starting with
shown at the top part of Figure 3a. Upon its triggered remapping,
how remapping events are triggered.
𝐴 moves to its PLA position at the bottom part by incrementing its
2.3.1 Remapping trigger. Selective remapping warrants the defini- index to 𝑖 + 1. In this remapping there is no collision with another
tion of a trigger event for moving a specific LLA from its current LLA. Figure 3b shows the next remapping of 𝐴, in which there is
PLA. A global write counter, used in most prior wear-leveling ar- collision with LLA 𝐹 ; to free the PLA to 𝐴, 𝐹 is moved to the PLA
chitectures (e.g., [14]), would not suffice in this case. Informally, we mapped to it by index 𝑗 + 3, because the lower indices 𝑗 + 1 and
remap an LLA written by the host if its current PLA has reached a 𝑗 + 2 map to used PLAs.
wear level that holds the risk of its premature failing. Toward this Figure 4 illustrates the catch-up procedure described above. For
end, we specify a wear threshold 𝜙 < 𝑤𝑚𝑎𝑥 , with the following 𝑆 = 4, the figure displays the mapping index of each LLA. Initially,
policy: a host write to an LLA mapped to a PLA that had exceeded base = 0 and 𝐴 has index 3 (top part). Upon remapping of 𝐴 (bottom
𝜙 writes will be written after remapping the LLA to a different part), its index is incremented to 4, which falls outside the current
PLA. Note that the policy is only applied to host writes; a write window {base = 0, 1, 2, 3 = 𝑆 − 1}, thus invoking catch-up. This
that is part of a remapping operation will not trigger an additional example implements the simple algorithm we use in this work:
remapping, even if the written PLA exceeded the threshold. This setting base ← base + 𝑆 = 4, and updating all LLAs to the new base.
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead

Before: A B C D E F G H I J K L mapping number corresponding to the initial mapping index 1 (we


skip index 0 in the randomized setting). Then, each update of the
𝑖
register using the linear feedback gives the output mapping num-
ber of the subsequent mapping index. To maintain Property 2 for
After: B C D A E F G H I J K L the output mapping numbers, we use an 𝑚-bit LFSR with period
2𝑚 − 1 = 𝑁 − 1. This guarantees that no two indices in 1, . . . , 𝑁 − 1
have the same output mapping number, and fixes the 𝑘−2𝑚 most sig-
𝑖+1
nificant bits of the output mapping number to the same value for all
(a) Regular remapping - non-colliding case. indices 1, . . . , 𝑁 − 1, thus implying 𝑓LFSR(𝑖 ) (𝐿𝐿𝐴) ≠ 𝑓LFSR( 𝑗 ) (𝐿𝐿𝐴)
𝑗 similarly to Property 2.

Before: B C D A E F G H I J K L
3 EVALUATION AND RESULTS
Before evaluating the proposed mapping architecture, we spec-
𝑖+1 𝑗+1 𝑗+2 𝑗+3 ify the formula we use to set the threshold parameter 𝜙 (See Sec-
𝑖+2
tion 2.3.1), as a function of the architecture parameters 𝑁 , 𝑤𝑚𝑎𝑥 , 𝑆.
The formula is based on a theoretical analysis of ECC-Map for the
After: B C D E A G H I J K F L
1-LLA workload that repeatedly writes to a single LLA until reach-
(b) Regular remapping - colliding case.
ing the device end of life. The detailed analysis can be found in the
appendix. We denote 𝛼 ≜ 𝜙/𝑤𝑚𝑎𝑥 as the fractional threshold, and
Figure 3: Illustration of regular remapping. (a) The non- set 𝛼 to be the following
colliding case. In this case, 𝐴 moves from index 𝑖 to 𝑖 + 1 
1 − 𝑁 , 𝑤𝑚𝑎𝑥 𝑁 < 𝑆
𝛼 opt = 2 𝑆𝑤𝑚𝑎𝑥 3 (4)
and reaches a free PLA. (b) The colliding case. In this case, otherwise
3,
the next index 𝑖 + 2 maps 𝐴 to a used PLA, thus requiring the
movement of 𝐹 to a subsequent index that maps it to a free The subscript “opt” is used to mark the fact that this 𝛼 maximizes
PLA. the utilization on the 1-LLA workload, according to the model and
its analysis in the appendix.
Before: A B C D E F G H I J K L
3.1 Evaluation
3 0 0 2 1 1 3 0 2 3 2 1 Implementation. To evaluate the performance of the proposed ar-
chitecture, we implemented all of its ingredients in a Python-based
After: F A C B I G E H L D K J discrete-event simulator. The simulator accepts an arbitrary write
workload, and runs it through the proposed device mapping layer,
including exact management of the indexed mapping functions, and
4 4 4 4 4 4 4 4 4 4 4 4
performing all remappings (regular and catch-up). Upon reaching
the device end of life, that is, when a PLA first exceeds 𝑤𝑚𝑎𝑥 writes,
Figure 4: Illustration of the catch-up procedure. Initially,
the simulator stops and records the utilization value for this work-
base = 0 and all LLAs have indices in the range {0, 1, 2, 3}.
load. Using a software simulator allows us to examine the device
Then 𝐴 is remapped to index 4, invoking a catch-up procedure
performance in a large variety and broad range of system variables.
leading to base = 4 and all LLAs mapped with index 4.
The principal system variable in our evaluation, which turns out
to be the key performance determinant, is the size-to-endurance
The final ingredient of the proposed mapping architecture is ratio 𝑁 /𝑤𝑚𝑎𝑥 . In general, as 𝑁 /𝑤𝑚𝑎𝑥 grows, the more “difficult” it
index randomization, applied for the purpose of hiding the instan- becomes for a given mapping architecture to level the wear. This
taneous mapping functions from an adversary generating the write fact is observed in [14], with the inequality 1/𝜓 > 𝑁 /𝑤𝑚𝑎𝑥 , where
workload. the left-hand side is the frequency of internal-copy writes required
by the SG architecture. The same ratio 𝑁 /𝑤𝑚𝑎𝑥 also appears (twice)
2.4 Mapping-index Randomization in (4). The principal dependence on the ratio 𝑁 /𝑤𝑚𝑎𝑥 allows us
As we make the standard assumption that the mapping functions to use relatively small values in most tests (𝑁 = 1024, varying
used by the architecture are publicly known, an adversary may be 𝑤𝑚𝑎𝑥 ), significantly speeding up the evaluation. To prove that the
able to track the mapping of LLAs and issue writes to those mapped absolute values of 𝑁 ,𝑤𝑚𝑎𝑥 are secondary to their ratio, our evalua-
to high-wear PLAs. To prevent this, we add a pseudo-random trans- tions include tests we repeat with 4 and 16 times larger 𝑁 ,𝑤𝑚𝑎𝑥 ,
formation between the running mapping indices 1, 2, 3, . . . , 𝑁 − 1 showing no significant difference. We thus expect that much larger
and the actual mapping numbers fed to the mapping functions commercial devices with these size-to-endurance ratios – for exam-
(forward and inverse) of Figure 1a and Figure 1b. ple, a 𝑁 /𝑤𝑚𝑎𝑥 = 2 device with 2G-lines and endurance 1𝑒9 – will
For the transformation we use a standard linear-feedback shift perform similarly.
register (LFSR) [2, 18], which is initialized to a random seed gen- In addition to 𝑁 /𝑤𝑚𝑎𝑥 , other system variables we examine in
erated internally by the device. The random seed is the output our evaluation are the window size 𝑆, the spare factor 𝜌, and the
Natan Peled and Yuval Cassuto

trigger threshold 𝜙. For convenience, we list in Table 1 the default 3.2 Results
values we use for these system variables, unless noted otherwise We first use the default values from Table 1 and plot in Figure 5a the
(each result typically varies one variable, leaving the rest to their utilizations of the four architectures for each of the four workloads.
default values). It is first observed that ECC-Map significantly outperforms the three
prior architectures on the 1-LLA workload. RBSG’s performance
system variable default value is satisfactory only on the uniform workload, and SG’s is lower
Size-to-endurance ratio 𝑁 /𝑤𝑚𝑎𝑥 0.5 than ECC-Map on all workloads except Zipf, on which it is very
Window size 𝑆 32 close to ECC-Map. Sec-Mem’s performance on the stress workload
Spare factor 𝜌 20% is about a 1/3 worse than ECC-Map. On the uniform workload
Trigger threshold 𝜙 set by (4) to 𝛼 opt𝑤𝑚𝑎𝑥 all architectures have good performance, as expected, with Sec-
Table 1: Default values of system variables. Mem slightly ahead of ECC-Map coming second. We continue
the experiment with progressively larger size-to-endurance ratios
𝑁 /𝑤𝑚𝑎𝑥 . Recall from the discussion in Section 3.1 that larger ratios
Comparison. In addition to studying the performance of the are in general more difficult to wear-level. Moving from the default
proposed architecture, this section compares this performance to value of 0.5 in Figure 5a, we plot in Figures 5b-5e the utilizations for
the three state-of-the-art wear-leveling architectures for PCM: the four larger values of 𝑁 /𝑤𝑚𝑎𝑥 , each time multiplying it by 2. We get
SG and RBSG architecture [14], where the latter adds region parti- these ratios by fixing 𝑁 and halving 𝑤𝑚𝑎𝑥 successively. The value
tion on the former, and the region-based secure-PCM main-memory of 𝜙 is calculated using (4) for each tested ratio. We indeed see that
architecture [15] (Sec-Mem). The last of the three works by dynam- increasing the size-to-endurance ratio decreases the utilizations on
ically remapping a full region every certain number of host writes the 1-LLA and stress workloads. However, this decrease is much
to it since the last remapping. Note that dynamic region mapping more graceful in ECC-Map than in Sec-Mem, while both SG and
requires a mapping table with size linear in the number of regions. RBSG have near-zero utilizations on these workloads starting from
While there are follow-up works enhancing these architectures 𝑁 /𝑤𝑚𝑎𝑥 = 2. The performance of ECC-Map on the Zipf workload
in different ways, these three works are the best known for the even improves with larger size-to-endurance ratios, for reasons
standard device model we consider here. Therefore, we expect to that will be explained later in Section 3.2.3. The very low utilization
see similar advantage over other variants, which also use global values of both SG/RBSG and Sec-Mem throughout Figure 5 mean
or region write counters as a trigger for remapping. For the SG that these architectures cannot be used by devices with such size-
architecture we use a device with the same logical capacity 𝐾, and to-endurance ratios.
a single spare PLA, as specified in [14]; for RBSG we use the same
3.2.1 Dependence on the absolute device size. In the next experi-
𝐾 and 𝑁 parameters as ECC-Map. The secure PCM architecture
ment, we examine the (in)sensitivity of the results to the absolute
does not need spare, thus it is used with 𝑁 = 𝐾 (and the same
values of 𝑁 ,𝑤𝑚𝑎𝑥 , thus corroborating our claim that performance
𝐾). Note that the comparison is fair even if the values of 𝑁 are
is determined by their ratio 𝑁 /𝑤𝑚𝑎𝑥 . Toward that, we ran the same
not equal, because the utilization metric penalizes the increased 𝑁
workloads with the same ratios and three different device sizes
appropriately.
𝑁 = 1024, 4096, 16384 (with corresponding 𝑤𝑚𝑎𝑥 values). The re-
Measurements. We ran different write workloads, and for each
sults are recorded in Table 2, showing almost identical values of
architecture we counted the total number of writes (host and phys-
(logical) utilization between the different sizes.
ical) the device served until its end of life. We recorded several
performance metrics: the (logical) utilization (1), the number of host
3.2.2 Dependence on the window size 𝑆. In Figure 6, we plot the
writes, and the number of physical writes. We repeated each test
utilization’s dependence on the window-size parameter 𝑆. Recall
five times and averaged the results to smooth out the workload
that 𝑆 controls the mapping richness/complexity, so it is important
randomness.
to examine its effect on performance. We ran the workloads using
Workloads. We tested four write workloads in our evaluations:
the default device parameters, each time implementing a different
1) the 1-LLA workload, 2) the stress workload, 3) the uniform work-
window size 𝑆 = 16, 32, 64, 128. First, the results show that for all
load, and 4) the Zipfian distribution Zipf. In 1), we randomly choose
values of 𝑆 and all workloads, the architecture achieves significant
a single LLA, and write to it repeatedly until reaching end of life.
utilization (for comparison, we plot the RBSG 1-LLA utilization as
In 2), we randomly pick a 3% fraction of the LLAs and write only to
a horizontal line). It can be observed that significant improvement
them, where the selection within the set is uniform. In 3), each write
is offered to the 1-LLA workload when increasing 𝑆 from 16 to 32,
draws an LLA uniformly from the entire space. In 4), each write
while subsequent increases give more modest advantages. That
draws an LLA from the whole address space with a non-uniform
1/𝑖 means that for these device parameters, 𝑆 = 32 may be the right
selection that follows the distribution 𝑝 (𝑖, 𝐾) = Í𝐾 , where 𝑖 compromise between performance and mapping cost. It can also
𝑛=1 1/𝑛
is the LLA’s sequence number and 𝐾 is the number of LLAs. The be seen that the uniform and stress workloads are less sensitive
1-LLA workload is the key motivation of this work, hence it will be to the value of 𝑆. This is because more balanced workloads have
the focus of the evaluation. The stress and Zipf workloads model more balanced mapping-index distributions, and thus fewer catch-
other challenging write patterns that the device needs to handle, up remappings even when 𝑆 is small. The stress workload sees
and the “easier” uniform workload is included mainly as reference, some small utilization decrease in 𝑆 = 128, which can be attributed
since it is handled well by prior wear-leveling architectures. to the fact that 𝜙 is optimized for the 1-LLA workload. In a real
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead

1 1 1

0.8 0.8 0.8


Utilization

Utilization

Utilization
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
1-LLA Stress Uniform Zipf 1-LLA Stress Uniform Zipf 1-LLA Stress Uniform Zipf

ECC-map Sec-Mem RBSG SG ECC-map Sec-Mem RBSG SG ECC-map Sec-Mem RBSG SG

𝑁 𝑁 𝑁
(a) 𝑤𝑚𝑎𝑥 = 0.5 (b) 𝑤𝑚𝑎𝑥 =1 (c) 𝑤𝑚𝑎𝑥 =2
1 1

0.8 0.8
Utilization

Utilization
0.6 0.6

0.4 0.4

0.2 0.2

0 0
1-LLA Stress Uniform Zipf 1-LLA Stress Uniform Zipf

ECC-map Sec-Mem RBSG SG ECC-map Sec-Mem RBSG SG

𝑁 𝑁
(d) 𝑤𝑚𝑎𝑥 =4 (e) 𝑤𝑚𝑎𝑥 =8

Figure 5: Utilization as a function of the size-to-endurance ratio.


Workload Device size - 𝑁 𝑤𝑚𝑎𝑥 𝜙 Logical utilization (1) Host writes Physical writes
1024 128 96 0.61 80540 108779.8
1-LLA 4096 512 384 0.61 1281893.6 1777136
16384 2048 1536 0.61 20443371 28573809.8
1024 128 96 0.65 85005.2 100765.4
Uniform 4096 512 384 0.65 1368310.4 1702714.2
16384 2048 1536 0.65 21910283.2 28316701.2
1024 128 96 0.73 95844.6 115641.8
Stress 4096 512 384 0.74 1559924.4 1952942.6
16384 2048 1536 0.75 24888046.8 31827576
1024 128 96 0.55 71901.2 86187.2
Zipf 4096 512 384 0.56 1184265.8 1495514.2
16384 2048 1536 0.54 18242031.2 23847574.6
Table 2: Comparing detailed run statistics (averaged) for three different device sizes with the same ratio 𝑁 /𝑤𝑚𝑎𝑥 = 8.

implementation of the architecture, one may choose to set 𝜙 to 0.8. The effect of this is demonstrated in Figure 7, comparing the
jointly optimize for different workloads, but good utilization is performance of ECC-Map with and without this modification. It can
achieved even with the simple formula used in this work. be seen that the Zipf utilization increased from around 0.4 (as also
3.2.3 Dependence on the remapping threshold 𝜙. To expand on seen in Figure 5a) to over 0.7. At the same time, the modification did
the issue of optimizing the threshold 𝜙, we point back to Figure 5 decrease the 1-LLA (and stress) utilizations, but not significantly
and note that for 𝑁 /𝑤𝑚𝑎𝑥 = 0.5, the value of the fractional ratio so.
𝜙/𝑤𝑚𝑎𝑥 according to (4) is as high as 1−1/64. Such high thresholds, To validate the correctness of the 𝜙 opt derived in (4), we next
while optimal for the 1-LLA workload, limit the performance on want to see the utilization as a function of 𝜙. We define 𝜙 opt ≜ 𝛼 opt ·
the Zipf workload. Thus, a possible solution is to set 𝜙/𝑤𝑚𝑎𝑥 as the 𝑤𝑚𝑎𝑥 , and for convenience plot in Figure 8 the utilization as a func-
minimum between the outcome of (4) and a predefined limit, e.g., tion of 𝜙/𝜙 opt −1. The x-axis point 0 represents the value of 𝜙 = 𝜙 opt
Natan Peled and Yuval Cassuto

1 1

0.8 0.8
Utilization

Utilization
0.6 0.6

0.4 0.4
1-LLA 1-LLA
Uniform Uniform
0.2 Stress 0.2 Stress
RBSG-Advers Zipf
0 0
16 32 64 128 −0.3 −0.2 −0.1 0 0.1
S ϕ/ϕopt − 1
Figure 6: Utilization as a function of the mapping window
size 𝑆. Figure 8: Utilization as the threshold-trigger 𝜙 is varied from
1 its optimized value.

0.8
1
Utilization

0.6
0.8
Utilization

0.4
0.6

0.2
0.4

0 0.2
1-LLA Stress Uniform Zipf

0
ϕopt min (ϕopt , 0.8wmax )
10% 15% 20% 25%

Figure 7: ECC-map with 𝜙 = 𝜙 opt vs.


1-LLA Uniform Stress Zipf
with 𝜙 = min (𝜙 opt, 0.8𝑤𝑚𝑎𝑥 ).

Figure 9: Utilization for different values of spare factor.


as specified by (4), where the values to the left and right of that
point represent smaller and bigger 𝜙 values, respectively. One can
see that utilization on the 1-LLA (and stress) workload is maximized corresponding spare factors. Recall from Section 2.3.2 that using
close to the value of 0, indicating the correctness of the analysis spare PLAs helps having non-colliding regular remappings, which
leading to the specified value 𝜙 opt . Further, it is seen that a small reduces the amount of internal-copy writes and slows down the
increase in 𝜙 may harm utilization significantly (for all workloads advancement of mapping indices. Firstly, the plot shows that even
except the uniform), while a decrease is largely harmless and even with spare factor as small as 0.1 the utilization is significant. That
helpful for the Zipf workload. For good performance in all work- said, increasing it to 0.15 gives substantial advantage in the more
loads, this plot motivates adopting the point of 𝜙/𝜙 opt = 0.8, which challenging workloads of 1-LLA and Zipf. Increasing it further ex-
we also use in the next sub-section. The plot also gives an important hibits diminishing returns. We reiterate the fact that the utilization
insight for design purposes: since one may have only an estimate metric takes into account the added cost of the spare PLAs, thus
of the line wear, it is important that it will be an overestimate, such giving a fair comparison across different spare factors. The fact that
that harmless premature remappings are favored over late ones the utilization is normalized by 𝑁 means that it is not necessarily
that decrease utilization. increasing with the spare factor, as seen in the decreasing trend
3.2.4 Dependence on the spare factor. Finally, in Figure 9 we plot for the uniform workload. This decrease can be attributed to the
the utilization for four different values of spare factor 𝜌 = 0.1, 0.15, growing number of remappings needed to claim unused endurance
0.2, 0.25. 𝑁 is fixed to its default value, and 𝐾 is varied to get the in more spare PLAs.
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead

4 IMPLEMENTATION CONSIDERATIONS LLA Compact index


PLA:
0 1 base: +0
Toward using the ECC-Map architecture in a real memory device, PLA|3
this section expands upon important details needed for efficient 1 0

implementation. Primarily, it discusses ways by which the window- 2 3 +1


3 3
PLA|1
size parameter 𝑆 translates to bounded implementation complexity.
4 2 LLA
+2
5 3 PLA|3
4.1 Achieving Low-Complexity Mapping
The basic requirement from the architecture to support read and +3
Base index: 5002 PLA|2
write operations is to be able to map any LLA to its current PLA.
In ECC-Map, this requirement is reduced to mapping an LLA to its (a) (b)
current mapping index, thence finding the PLA by simple invocation
of the mapping function. Using a global register for the current base Figure 10: (a) The straightforward forward-mapping imple-
mapping index, this requirement is further reduced to mapping an mentation: Reduced-size mapping table, example for 𝑆 = 4.
LLA to its offset from the base index – see (3) in Section 2. This last A global base register, and log2 𝑆 offset bits for each LLA. (b)
reduction is the key promise toward achieving efficient mapping: Most economical forward-mapping implementation: Map-
representing the offset index only requires log2 𝑆 bits (rounded ping with no data structure, example for 𝑆 = 4. First calculat-
upward to the next integer), while representing the PLA or the 3 , then reading from the media
ing the PLAs {𝑓base+𝑖 (𝐿𝐿𝐴)}𝑖=0
full mapping index requires log2 𝑁 bits. Recall that 𝑁 is a device the PLAs’ offset values, and choosing the one that matches
parameter that grows with the scaling of the memory size, while 𝑆 the offset used in the forward mapping (the PLA marked in
is a small fixed constant. For example, in a 500GB memory device green).
with line size of 512B, we have 𝑁 ≈ 230 , while the “recommended”
window-size parameter in Section 3 is 𝑆 = 32 = 25 .
In addition to the forward (LLA→PLA) mapping, the remapping store in every PLA its offset index. Thus, given an LLA, we can
operations described in Section 2.3 require inverse-mapping a PLA find all 𝑆 PLAs that may be mapped to it for the current base, and
to its mapping index. The simplest and lowest-cost way to support read from each one the log2 𝑆 inverse-mapping bits. The 𝑆 PLAs
inverse mapping is by storing the offset index on the line itself on are {𝑓base (𝐿𝐿𝐴), 𝑓base+1 (𝐿𝐿𝐴), . . . , 𝑓base+𝑆 −1 (𝐿𝐿𝐴)}, and it can be
the media, alongside the data. The cost is negligible when the line verified1 that there is exactly one 𝑃𝐿𝐴 = 𝑓𝑖 (𝐿𝐿𝐴) whose stored
size is much larger than log2 𝑆. Next, we describe three possible inverse-mapping offset bits represent the index 𝑖 according to (3).
implementation approaches of the forward mapping in ECC-Map. This matching PLA is found as the current mapping of the LLA. See
Figure 10b. While this method requires accessing multiple physical
4.1.1 Reduced-size mapping table. The straightforward way to im-
locations on the memory media, only a small number 𝑆 log2 𝑆 of
plement forward mapping in the proposed architecture is through a
bits are read in total, and this operation may also be parallelized
table mapping each LLA to its offset mapping index, see Figure 10a.
depending on the memory technology.
This already gives a major advantage over full-indirection mapping:
for example, in a 500GB device with line size 512B and 𝑆 = 32 this 4.2 Efficient LFSR Transformation
saves 1 − log2 𝑆/log2 𝑁 = 83.25% of the table memory cost, where Recall from Section 2.4 that the running index 𝑖 calculated in (3)
𝑁 = 500𝑒9/512. undergoes an LFSR transformation before entering the index fields
of the mapping functions in Figure 1a and Figure 1b. To perform
4.1.2 Advanced mapping data structures. The fact that 𝑆 is a rela- these transformations, the device has to have efficient access to the
tively small constant opens the way to devising clever data struc- values 𝐿𝐹𝑆𝑅(𝑖), for 𝑖 = base, . . . , base + 𝑆 − 1. This can be achieved
tures for mapping LLAs to offsets, which will be more economical either by maintaining a cache holding these 𝑆 values, or by efficient
than a table. Such data structures are beyond the scope of this work, cycling of the LFSR back and forth in this range.
but we give a simple example case to clarify this direction. Suppose
that the base mapping index is defined to be the default, and the 5 RELATED WORK
mapping data structure only needs to record LLAs having other Wear leveling is a key problem in designing and deploying wear-
indices. Then we need a data structure that records membership in limited memories, and has thus attracted considerable prior re-
the remaining 𝑆 − 1 offsets, and whose size is proportional to the search attention. The problem initially arose in Flash-based memo-
number of LLAs not using the default index. Keeping this data struc- ries, and later considered for phase-change memories (PCM) and
ture size under the constraints of the pre-allocated memory is an related emerging technologies for persistent memories. In Flash
interesting optimization problem. In unbalanced workloads (such memory the problem is somewhat simplified, at least from the
as 1-LLA and stress), we naturally get that the majority of LLAs use mapping perspective, thanks to the common employment of full
the same mapping index, and in more balanced workloads we can indirection in the flash translation layer (FTL). In persistent mem-
afford triggering special catch-up remappings for consolidating the ories based on PCM (and related) technologies, full indirection is
mapping data structure. not a likely option, due to the lack of need for out-of-place writing,
4.1.3 Mapping with no data structure. It is possible to implement
and the smaller line sizes.
the forward mapping without using any data structure, except for
the global base register. Recall that for the inverse mapping we 1 otherwise violating the proven property that an LLA is mapped to exactly one PLA.
Natan Peled and Yuval Cassuto

Prior wear-leveling solutions, for both Flash and PCM devices, multiple block remapping thresholds, reducing the number of writes
use a variety of techniques, at different parts of the memory stack: between remappings as the device ages.
from the operating system to the physical representation of cell
levels. We now briefly mention a non-exhaustive sample of these
techniques.
6 CONCLUSION
In this work, we present ECC-Map, a novel wear-leveling scheme
5.1 Wear-leveling and Related Techniques for for persistent memories that can handle even the most unbalanced
workloads. A family of efficient functions based on ECC encoders
PCM Devices provides flexible and economical mapping, and enables remap-
The most celebrated PCM wear-leveling architecture is Start-Gap [14], ping operations that are more targeted to the incident workload.
thanks to its simplicity and extremely efficient mapping layer. The ECC-Map’s remapping algorithms are extremely simple, which is
key technique used in Start-Gap [14] is periodical line shifting, important for implementation on device controllers. Toward that,
called therein gap movements. In addition, it proposes to divide the many interesting topics are left for future work. Among them: 1)
device to regions to better mitigate extremely unbalanced workloads the organization of the mapping meta-data on the memory me-
(though at the cost of unused endurance in some regions). In [26] dia, 2) the optimization and scheduling of remapping operations,
and [15], line shifting is complemented by region swapping for im- and 3) further improvements to the proposed mapping algorithms,
proved wear spreading. While region swapping helps, its flexibility offering interesting tradeoffs among different workloads.
for claiming unused endurance depends on the region size, and fine
region partition requires large mapping tables. Region swapping
is further enhanced in [27] by considering endurance variation 7 ACKNOWLEDGEMENT
among different regions for selecting the swap target. Exploiting This work was supported in part by the Israel Science Foundation
variation is a useful technique, complementary to the design of the under grant number 2525/19.
mapping architecture, and can also enhance the proposed ECC-Map
architecture. Another technique used in almost all wear-leveling
REFERENCES
architectures is address randomization for hiding the mapping from
[1] Dmytro Apalkov, Alexey Khvalkovskiy, Steven Watts, Vladimir Nikitin, Xueti
an adversary, as we implement here in ECC-Map. Tang, Daniel Lottis, Kiseok Moon, Xiao Luo, Eugene Chen, and Adrian Ong.
Additional works address wear leveling as part of larger archi- Spin-transfer torque magnetic random access memory (STT-MRAM). ACM
Journal on Emerging Technologies in Computing Systems (JETC), 9(2):1–35, 2013.
tectural settings, building on the techniques mentioned in the previ- [2] Paul H. Bardell, William H. McAnney, and Jacob Savir. Built-in Test for VLSI:
ous paragraph. [3–5] incorporate wear leveling into the operating- Pseudorandom Techniques. Wiley-Interscience, 1987.
system stack; [24, 25] combine PCM and DRAM (the latter having [3] Yu-Ming Chang, Pi-Cheng Hsiu, Yuan-Hao Chang, Chi-Hao Chen, Tei-Wei Kuo,
and Cheng-Yuan Michael Wang. Improving PCM Endurance with a Constant-
much higher endurance); and [23] proposes a novel hardware ad- Cost Wear Leveling Design. ACM Trans. Des. Autom. Electron. Syst., 22(1), jun
dress decoder (PRAD) that can help in wear leveling (among other 2016.
[4] Chi-Hao Chen, Pi-Cheng Hsiu, Tei-Wei Kuo, Chia-Lin Yang, and Cheng-
things). Yuan Michael Wang. Age-Based PCM Wear Leveling with Nearly Zero Search
A vastly studied approach, related to wear leveling, is wear re- Cost. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12,
duction. [6] proposes a physical-writing mechanism for PCM that page 453–458, New York, NY, USA, 2012. Association for Computing Machinery.
[5] Sheng-Wei Cheng, Yuan-Hao Chang, Tseng-Yi Chen, Yu-Fen Chang, Hsin-Wen
reduces the write wear. [10] uses information from the L1 cache Wei, and Wei-Kuan Shih. Efficient Warranty-Aware Wear Leveling for Embed-
to write only modified data to the PCM media. A similar objec- ded Systems With PCM Main Memory. IEEE Transactions on Very Large Scale
tive is pursued in [8], which in addition presents a wear-leveling Integration (VLSI) Systems, 24(7):2535–2547, 2016.
[6] Sangyeun Cho and Hyunjin Lee. Flip-N-Write: A Simple Deterministic Technique
scheme in PCM when it acts as a cache. Wear reduction techniques to Improve PRAM Write Performance, Energy and Endurance. In Proceedings
are extremely useful in practice, and can similarly enhance the of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO 42, page 347–357, New York, NY, USA, 2009. Association for Computing
performance of ECC-Map. Machinery.
[7] George C. Clark and J. Bibb. Cain. Error-correction coding for digital communica-
5.2 Wear-leveling Techniques for Flash Devices tions. Plenum Press New York, 1981.
[8] Yongsoo Joo, Dimin Niu, Xiangyu Dong, Guangyu Sun, Naehyuck Chang, and
Flash-based memories differ from PCM and newer persistent mem- Yuan Xie. Energy-and endurance-aware design of phase change memory caches.
In 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010),
ories in their internal structure of large update units (called blocks), pages 136–141. IEEE, 2010.
each comprising many lines (known as pages). Due to this struc- [9] Miguel Angel Lastras-Montano and Kwang-Ting Cheng. Resistive random-access
ture, Flash wear leveling is done at the larger block granularity, and memory based on ratioed memristors. Nature Electronics, 1(8):466–472, 2018.
[10] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Architecting Phase
assuming the availability of a translation layer supporting flexible Change Memory as a Scalable Dram Alternative. SIGARCH Comput. Archit. News,
logical to physical mapping. Most of the techniques use a table 37(3):2–13, jun 2009.
that tracks the wear of each data block, hence picking low-wear [11] Dongzhe Ma, Jianhua Feng, and Guoliang Li. A survey of address translation
technologies for flash memories. ACM Comput. Surv., 46(3), jan 2014.
blocks for the incoming host writes. [16] considers the endurance [12] Frederic P. Miller, Agnes F. Vandome, and John McBrewster. Cyclic Redundancy
variability among different blocks, and tabulates block reliability Check: Computation of CRC, Mathematics of CRC, Error Detection and Correction,
Cyclic Code, List of Hash Functions, Parity Bit, Information ... Cksum, Adler- 32,
statistics based on measuring program error rate. ECC-Map can also Fletcher’s Checksum. Alpha Press, 2009.
be extended to consider variability, by setting variable 𝜙 thresholds [13] Ardavan Pedram, Stephen Richardson, Mark Horowitz, Sameh Galal, and Shahar
for different parts of the device. [19] further extends the reliability Kvatinsky. Dark Memory and Accelerator-Rich System Optimization in the Dark
Silicon Era. IEEE Design & Test, 34(2):39–50, 2017.
estimation by considering retention errors through time measure- [14] Moinuddin K. Qureshi, John Karidis, Michele Franceschini, Vijayalakshmi Srini-
ments between consecutive program cycles. [22] suggests using vasan, Luis Lastras, and Bulent Abali. Enhancing lifetime and security of
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead

PCM-based Main Memory with Start-Gap Wear Leveling. In 2009 42nd An-
nual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages
14–23, 2009.
[15] Andre Seznec. A Phase Change Memory as a Secure Main Memory. IEEE
Computer Architecture Letters, 9(1):5–8, 2010.
[16] Xin Shi, Fei Wu, Shunzhuo Wang, Changsheng Xie, and Zhonghai Lu. Program
error rate-based wear leveling for NAND flash memory. In 2018 Design, Automa-
tion & Test in Europe Conference & Exhibition (DATE), pages 1241–1246. IEEE,
2018.
[17] Tae-Sun Chung and Dong-Joo Park and Sangwon Park and Dong-Ho Lee and
Sang-Won Lee and Ha-Joo Song. A survey of flash translation layer. Journal of
Systems Architecture, 55(5):332–343, 2009.
[18] Thomas E. Tkacik. A hardware random number generator. In Burton S. Kaliski,
çetin K. Koç, and Christof Paar, editors, Cryptographic Hardware and Embedded
Systems - CHES 2002, pages 450–453, Berlin, Heidelberg, 2003. Springer Berlin
Heidelberg.
[19] Debao Wei, Liyan Qiao, Xiaoyu Chen, Mengqi Hao, and Xiyuan Peng. SREA: A
self-recovery effect aware wear-leveling strategy for the reliability extension of
NAND flash memory. Microelectronics Reliability, 100-101:113433, 2019. 30th
European Symposium on Reliability of Electron Devices, Failure Physics and
Analysis.
[20] H-S Philip Wong, Simone Raoux, SangBum Kim, Jiale Liang, John P Reifenberg,
Bipin Rajendran, Mehdi Asheghi, and Kenneth E Goodson. Phase change memory.
Proceedings of the IEEE, 98(12):2201–2227, 2010.
[21] Ming-Chang Yang, Yu-Ming Chang, Che-Wei Tsao, Po-Chun Huang, Yuan-Hao
Chang, and Tei-Wei Kuo. Garbage collection and wear leveling for flash memory:
Past and future. In 2014 International Conference on Smart Computing, pages
66–73, 2014.
[22] Yuan Hua Yang, Xian Bin Xu, Shui Bing He, Fang Zhen, and Yu Ping Zhang.
WLVT: A Static Wear-Leveling Algorithm with Variable Threshold. In Advanced
Materials Research, volume 756, pages 3131–3135. Trans Tech Publ, 2013.
[23] Leonid Yavits, Lois Orosa, Suyash Mahar, João Dinis Ferreira, Mattan Erez, Ran
Ginosar, and Onur Mutlu. WoLFRaM: Enhancing Wear-Leveling and Fault
Tolerance in Resistive Memories using Programmable Address Decoders. In 2020
IEEE 38th International Conference on Computer Design (ICCD), pages 187–196,
2020.
[24] HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A. Harding, and
Onur Mutlu. Row buffer locality aware caching policies for hybrid memories.
In 2012 IEEE 30th International Conference on Computer Design (ICCD), pages
337–344, 2012.
[25] Wangyuan Zhang and Tao Li. Characterizing and mitigating the impact of
process variations on phase change based memory systems. In 2009 42nd Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 2–13,
2009.
[26] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A Durable and Energy Efficient
Main Memory Using Phase Change Memory Technology. SIGARCH Comput.
Archit. News, 37(3):14–23, jun 2009.
[27] Wen Zhou, Dan Feng, Yu Hua, Jingning Liu, Fangting Huang, and Pengfei Zuo.
Increasing lifetime and security of phase-change memory with endurance varia-
tion. In 2016 IEEE 22nd International conference on parallel and distributed systems
(ICPADS), pages 861–868. IEEE, 2016.
Natan Peled and Yuval Cassuto

APPENDIX / SUPPLEMENTARY MATERIAL Host writes

Discussion of the Overall Device Operation α · wmax


As detailed in Section 2.3, host writes go directly to their mapped
PLAs so long that the PLA’s write count is below 𝜙, and trigger (2 · α − 1) · wmax
remapping otherwise. That means that a PLA can reach its end-
of-life of 𝑤𝑚𝑎𝑥 writes only during a remapping operation. There
are two factors by which the utilization (1) is degraded from its
theoretical limit of 1: 1) remaining unused writes of PLAs, and
2) internal-copy writes during remapping. There is at most one Catch-up
internal-copy write in a regular remapping operation: one when (1 − α) · wmax number
the remapping is colliding, and none when it is non-colliding. Con-
sidering 𝑓𝑖 (·) as a random function whose output is uniformly Figure 11: Host writes per PLA vs. the catch-up number.
drawn from {0, . . . , 𝑁 − 1}, the probability that a regular remapping
is colliding equals 𝐾/𝑁 = 1 − 𝜌. Hence the number of internal- leaves sufficient endurance for (1 − 𝛼)𝑤𝑚𝑎𝑥 catch-up remappings.
copy writes can be reduced by increasing the spare factor 𝜌. In Between the 𝑡-th and (𝑡 + 1)-th catch-up remappings, 𝑆 PLAs each
a colliding regular remapping, the mapping index of the evicted serve 𝛼𝑤𝑚𝑎𝑥 −𝑡 host writes (𝑡 writes are consumed by the previous
𝐿𝐿𝐴′ is incremented 𝛿 times before a free PLA is found. With the catch-up remappings). In other words, the number of host writes
same randomness assumption above, 𝛿 is geometrically distributed per PLA is decreasing linearly as a function of the catch-up number
with parameter 𝜌 and mean 1/𝜌. That means we need to choose 𝜌 with slope −1; this is shown graphically in Figure 11. The total
such that 1/𝜌 ≪ 𝑆, otherwise, colliding regular remappings will number of host writes served by the device equals the area of
frequently consume the entire window of 𝑆 indices, invoking a the trapezoid in the figure, multiplied by the constant 𝑆. Thus the
costly catch-up remapping. Sample parameters that satisfy this optimal 𝛼 maximizes the expression for this area, given by
requirement are 𝜌 = 0.15 and 𝑆 = 64 ≫ 6.66. (𝛼𝑤𝑚𝑎𝑥 + (2𝛼 − 1)𝑤𝑚𝑎𝑥 ) · (1 − 𝛼)𝑤𝑚𝑎𝑥
. (5)
2
Performance Estimate for the 1-LLA Workload Taking the derivative and equating to zero, we get 𝛼 ∗ = 2/3. Note
For the 1-LLA workload and the catch-up algorithm we choose that this optimal 𝛼 ∗ applies so long that (1 − 𝛼 ∗ )𝑤𝑚𝑎𝑥 ≤ 𝑁 /𝑆,
in this work (see Section 2.3.3), the last of every 𝑆 consecutive because the right-hand side is an upper bound on the number of
regular remappings will invoke a catch-up remapping. Thus the catch-up remappings until utilizing the entire device. This is equiv-
proposed architecture can level the 1-LLA writes across the entire alent to the condition 𝑁 /(𝑆𝑤𝑚𝑎𝑥 ) ≥ 1/3. For the complement case
space of 𝑁 PLAs, costing 𝑁 /𝑆 catch-up remappings in total. Each 𝑁 /(𝑆𝑤𝑚𝑎𝑥 ) < 1/3, the x-axis of Figure 11 can reach the maximal
catch-up remapping costs on average 𝐾/𝑁 internal-copy writes number of catch-up remappings 𝑁 /𝑆 with 𝛼 ≤ 1 − 𝑁 /(𝑆𝑤𝑚𝑎𝑥 ), so
per PLA, giving on average 𝑁𝑆 · 𝑁𝐾 = 𝐾 internal-copy writes per
𝑆 setting 𝛼 = 1 − 𝑁 /(𝑆𝑤𝑚𝑎𝑥 ) maximizes the total number of host
PLA in the device lifetime. When 𝐾/𝑆 is not a large fraction of writes. We summarize the optimal values of 𝛼 in the following
𝑤𝑚𝑎𝑥 , the proposed architecture will be able to reach a high value equation:
of utilization. 
1 − 𝑁 , 𝑤𝑚𝑎𝑥 𝑁 < 𝑆
3
𝛼 opt = 2 𝑆𝑤𝑚𝑎𝑥 (6)
3, otherwise
Calculating the Trigger Threshold
We use this 𝛼 opt to set the trigger threshold in our evaluations in
Recall from Section 2.3.1 that 𝜙 is the wear threshold above which
Section 4. Note that the first case of (6) is the more favourable one
the PLA only serves remapping writes, and no direct host writes.
that fully utilizes the 𝑁 PLAs of the device. As we increase 𝑆, we
This immediately gives the criterion by which we need to set 𝜙: the
remain in this favourable case for larger 𝑁 /𝑤𝑚𝑎𝑥 ratios.
remaining endurance of 𝑤𝑚𝑎𝑥 − 𝜙 writes should suffice for serving
all future remapping writes into this PLA. If 𝜙 is set too high such Proof of Property 2
that this condition is not met, a remapping operation will cause a
PLA to exceed 𝑤𝑚𝑎𝑥 writes, ending the device lifetime prematurely If both 𝑖 and 𝑗 are in the range [0, . . . , 𝑁 − 1], all the non-zeros
before claiming all unused endurance. With this condition in mind, in their binary representations are confined to the 𝑚 = log2 𝑁
we want to set 𝜙 as high as possible to maximize the number of right-most bits of the index field. If both 𝑖, 𝑗 map the same 𝐿𝐿𝐴 to
host writes served by each PLA. the same 𝑃𝐿𝐴, then both [𝐿𝐿𝐴|𝑖 |𝑃𝐿𝐴] and [𝐿𝐿𝐴| 𝑗 |𝑃𝐿𝐴] must be
codewords. When subtracting (modulo 2) these codewords, we get
a third codeword all of whose non-zeros are confined to 𝑚 or less
Deriving the Optimal 𝜙 for the 1-LLA Workload consecutive coordinates, which is a contradiction when the code is
We denote 𝛼 ≜ 𝜙/𝑤𝑚𝑎𝑥 as the fractional threshold, and seek to find cyclic with redundancy 𝑟 = 𝑚.
the optimal 𝛼. In the following quantitative discussion we assume
the 1-LLA workload, and consider only the internal-copy writes of
catch-up remappings (the other internal-copy writes, during regular
remappings, are negligible in number). According to the condition
on 𝜙 stated earlier in Section 7, setting a fractional threshold of 𝛼

You might also like