ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture With Low Mapping Overhead
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture With Low Mapping Overhead
LLAs and PLAs, such that even if LLA write loads are extremely (2) A sliding window bounding the range of mapping indices
unbalanced, no PLA will exceed its endurance limit 𝑤𝑚𝑎𝑥 prema- used throughout the device at a given time. The window
turely. In addition to mapping flexibility, the architecture needs to size controls the mapping complexity.
specify the wear-leveling algorithms governing the evolution of this (3) Selective remapping of specific logical addresses from
mapping in the device lifetime, and the internal data movements. their current physical locations to a new location deter-
A well-known wear-leveling architecture, called Start-Gap (SG) [14], mined by a subsequent mapping index.
uses a very economical mapping structure and allowing basic ad- (4) A remapping trigger invoked when a physical location
dress remappings called gap movements. The gap is a spare PLA, reaches a designated wear threshold based on either a write-
and its movement is done by writing into it the LLA residing in count estimate or reliability estimate.
the adjacent PLA. While SG has demonstrated good wear-leveling (5) Mapping-index randomization to prevent an adversary
performance in some natural workloads, in extremely unbalanced from tracking mapping pairs and generating harmful adap-
workloads its performance is not satisfactory. In particular, when tive workloads.
𝑤𝑚𝑎𝑥 is not orders of magnitudes larger than 𝑁 , SG fails to ade- Ingredients 1-4 are new, to the best of our knowledge, while ingre-
quately level an adversarial workload that continuously writes to a dient 5 is a commonly used security measure.
single LLA, which we call the 1-LLA workload in this paper. This is We provide the details of the ECC-Map architecture in Section 2,
a major potential impediment in practice given the earlier stated and in Section 3 we present the performance evaluation of ECC-
trend of growing 𝑁 and shrinking 𝑤𝑚𝑎𝑥 . Some mitigations have Map on a device discrete-event simulation. The results show that
been proposed for this undesired behavior, but none of them fully high device utilizations can be reached even for 𝑁 /𝑤𝑚𝑎𝑥 ratios
solve the problem. Dividing the device to regions is a common useful that fail prior wear-leveling architectures. The promising results
proposition made in [14, 15, 26] (and others); but with the blessing of of the modeled device in the simulation environment motivate the
breaking 𝑁 into small “mini-devices” comes the curse (and complex- future implementation of ECC-Map in a hardware device within a
ity) of simultaneously managing many such mini-devices. Other working computing system. For that, in Section 4 we discuss finer
proposed approaches to deal with extremely unbalanced workloads implementation details toward a more practical realization of the
have been to cache those writes in an unlimited-endurance media, architecture in hardware.
or better yet, to block the writes from the device by caching them
“elsewhere”. While these may work in specific system settings, a 2 THE MAPPING ARCHITECTURE
stand-alone persistent-memory device can assume neither of the
It is imperative upon a memory device to maximize its ability to
two, and must guarantee adequate wear leveling on its own.
serve host writes before reaching its physical endurance limits. In
The new device architecture we present in this paper aims to
this work, we assume an endurance model whereby each PLA is
solve the wear-leveling problem by enhancing the flexibility of the
limited to 𝑤𝑚𝑎𝑥 total physical writes, and once exceeded, it cannot
LLA-PLA mapping, while keeping the costs associated with this
be used anymore for read or write. To guarantee full usability, we
new mapping small and controlled. The fundamental problem of
define the device’s lifetime as the time until any PLA exceeds 𝑤𝑚𝑎𝑥
wear-leveling mapping architectures is that they must be able to
writes. Hence, the key performance objective in this work is to
map frequently-written LLAs flexibly across the PLA space, and
maximize the total number of host writes served by the device in
tracking a flexible workload-dependent mapping costs memory and
its lifetime. Let 𝑁 be the number of PLAs in the device, that is, its
processing resources. If any LLA can be heavily written, resulting
physical capacity in units of lines. Then 𝑤𝑚𝑎𝑥 · 𝑁 is a fundamental
in its repeated remapping within the entire PLA space, it appears
upper bound on the number of host writes served within the device
intuitively necessary for the device to keep for each LLA its current
lifetime. With respect to this fundamental bound, we define the
full PLA address. Storing and maintaining such a mapping entails
writing utilization as
high cost (memory space) and complexity (persisting and/or wear-
leveling meta-data), which we would like to avoid for commercial #𝐻𝑜𝑠𝑡 𝑤𝑟𝑖𝑡𝑒𝑠
𝑈 𝑡𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 = , (1)
viability. Fortunately, contradicting this intuition, we show in the 𝑤𝑚𝑎𝑥 · 𝑁
sequel a mapping architecture that is able to relocate any LLA which corresponds to a specific write workload served by the device.
across the entire PLA space, without maintaining costly LLA to The utilization is a number between 0 and 1, the higher the better.
PLA mapping. Note that the utilization measure accounts for the full storage cost 𝑁 ,
The main idea of ECC-Map is to use a family of mapping functions even in the case of over-provisioned physical storage 𝑁 > 𝐾, where
to maintain the LLA-PLA mapping. A large family of functions gives 𝐾 is the number of LLAs. Furthermore, the numerator counts host
more flexibility than the simple functions used in prior work, and writes and not physical writes (the latter include internal writes by
at the same time more efficiency than offered by a mapping table the mapping layer), and thus (1) captures the true utility offered by
in memory. Around these mapping functions we design the entire the device to the customer. In contrast, some prior works (e.g., [14])
mapping architecture and its algorithms, which we summarize in use performance measures that count the total number of physical
the following by stating ECC-Map’s main ingredients. writes (including internal writes), and are thus not valid in cases of
significant write amplification. Later in the paper we evaluate the
(1) A family of efficiently computable mapping functions utilization for several important workloads, focusing primarily on
with properties allowing effective reclaiming of unused notoriously challenging workloads.
wear. Each member of the family is defined by an integer Improving the utilization by wear leveling is made possible by
mapping index. implementing a mapping layer that spreads the uneven LLA access
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead
m k−m m
more evenly across the PLA space. The mapping layer maintains a
dynamic mapping function between the 𝐾 LLAs and the 𝑁 PLAs; LLA index PLA
in general 𝑁 ≥ 𝐾, and we define 𝜌 = (𝑁 − 𝐾)/𝑁 ≥ 0 as the spare
factor of the device. At any point in time, the mapping function (a) Forward mapping.
needs to be injective, that is, not mapping multiple LLAs to the same k−m m m
PLA. The function needs not be surjective, that is, not all PLAs need
to map to LLAs. The simplest but most costly implementation of
index PLA LLA
the mapping function is by a mapping table having an entry for
(b) Inverse mapping.
each LLA storing its mapped PLA. A mapping table can wear level
effectively, but its hardware and maintenance costs are prohibitive.
Figure 1: (a) Forward mapping. The input LLA and the func-
Much more efficient mapping layers use a global efficiently com-
tion index (in white) comprise the input to the ECC encoder.
putable function: 𝑃𝐿𝐴 = 𝑓 (𝐿𝐿𝐴), where 𝑓 (·) changes in time, and
The encoder’s output (shaded orange) gives the PLA resulting
storing its specification can be done with little memory. The key of
from the mapping. (b) Inverse mapping. PLA and LLA ex-
the proposed mapping architecture of this work is to extend this
change roles: PLA is now part of the input (in white) and LLA
from a single function 𝑓 (·) to a family of functions {𝑓𝑖 (·)}, allowing
is the output (shaded blue). Since this layout is a cyclic shift
different LLAs to be mapped using different mapping indices 𝑖. The
from the forward mapping, the exact same encoder function
combined mapping function, which maps every LLA to the PLA
can be used.
output by the function 𝑓𝑖 (·) designated for it, needs to be an injec-
tive mapping at every given time. Adding the index to the mapping
function improves its flexibility to level the wear, while bounding get (from linearity) a third codeword all of whose non-zeros are
the mapping cost is possible by limiting the range of indices used confined to 𝑚 or less consecutive coordinates, which contradicts a
throughout the LLA space at any given time. known property of cyclic codes with redundancy 𝑟 = 𝑚.
The inverse mapping (from PLA and index to LLA) is shown in
2.1 Implementation of the Mapping Functions Figure 1b: the input is [𝑖 |𝑃𝐿𝐴] and the output is 𝐿𝐿𝐴. From the
cyclic property of the code and the fact that Figure 1b is obtained
To implement the family of functions, we use encoding functions
from Figure 1a by a cyclic shift of 𝑚 positions to the left, the in-
of cyclic error-correcting codes (ECC), used elsewhere for error
verse mapping can use the same encoding function used by the
correction and detection, including as cyclic redundancy check
forward mapping, but note the required reordering of the input
(CRC) [12] codes. We choose these functions for the several advan-
arguments [𝑖 |𝑃𝐿𝐴] vs. [𝐿𝐿𝐴|𝑖]. This family of functions also enjoys
tages they offer: 1) efficient hardware implementation, 2) simple
the following very useful property (proved in the Appendix).
reverse mapping, and 3) spreading an LLA mapping across the
entire PLA space (as we detail later). The third feature is critical Property 2. For any 0 ≤ 𝑖 < 𝑗 < 𝑁 , 𝑓𝑖 (𝐿𝐿𝐴) ≠ 𝑓 𝑗 (𝐿𝐿𝐴) for
for obtaining high utilization in adversarial workloads, and is in every 𝐿𝐿𝐴.
general not satisfied by alternative options such as cryptographic
pseudo-random permutations. Common cryptographic functions The importance of Property 2 is that a single LLA does not return
also require significantly higher computation load relative to cyclic to the same PLA before reaching index 𝑖 = 𝑁 . This allows the wear-
ECC encoding. leveling scheme using this mapping to utilize the endurance of all
Assuming 𝑁 (the number of PLAs) is an integer power of 2, we 𝑁 PLAs even if a single LLA is written by the host.
define an integer parameter 𝑚 = log2 𝑁 . For the function family
we take a binary cyclic ECC with parameters [𝑛, 𝑘], where 𝑛 is the 2.2 Sliding Window of Mapping Indices
codeword length and 𝑘 is the number of information bits input The large number (≥ 𝑁 ) of mapping indices supported by the
to the encoder. 𝑟 = 𝑛 − 𝑘 is the redundancy of the code, and the proposed mapping functions is clearly useful for effective spreading
code is specified by a binary generator polynomial of degree 𝑟 . A of the write load throughout the entire device. However, toward
convenient source for such codes is the family of primitive BCH limiting the resources consumed by the mapping, we restrict all
codes that exist for a rich variety of [𝑛, 𝑘] combinations; some LLAs to have mapping indices in a subset of 𝑆 consecutive indices.
sample generator polynomials of BCH codes can be found in [7]. This subset changes as a sliding window throughout the device
We choose a code with 𝑟 = 𝑚 and 𝑘 ≥ 2𝑚. The input to the encoder lifetime. That is, the mapping-index set is {base, base + 1, . . . , base +
is the binary vector [𝐿𝐿𝐴|𝑖], where | represents concatenation and 𝑖 𝑆 − 1}, for some integer base, and every LLA has a mapping index
is the mapping index. 𝐿𝐿𝐴 is represented as an 𝑚-bit vector and 𝑖 as 𝑖 = base + offset𝑖 , (2)
a (𝑘 − 𝑚)-bit vector, both using the standard binary representation.
The encoding is depicted in Figure 1a. The output of the encoder is where offset𝑖 ∈ {0, . . . , 𝑆 − 1}. This allows the mapping architecture
an 𝑟 = 𝑚-bit representation of the output 𝑃𝐿𝐴. Using the encoder to keep the value of base in a global register, and represent the index
as specified, we get the following. 𝑖 compactly as offset𝑖 , which takes only log2 𝑆 bits. Figure 2a depicts
this restriction of indices to a sliding window of size 𝑆 = 4. 𝑆 is a
Property 1. 𝑓𝑖 (·) is an injective function for every index 𝑖.
design parameter of the architecture: large 𝑆 allows more flexibility
Property 1 is proven by contradiction: assume there are two for selective remapping, but also increases the complexity and/or
different LLAs mapping to the same PLA for the same index. Then costs of maintaining the mapping. We defer to Section 4 the detailed
by subtracting the corresponding two codewords (modulo 2), we discussion of the effect of 𝑆 on mapping complexity. In the mean
Natan Peled and Yuval Cassuto
Function 𝑓! 𝑓" 𝑓# 𝑓$ 𝑓% 𝑓&'" differentiation makes the remapping procedures (discussed next)
simpler and more deterministic. The value of 𝜙 is an optimization
Index 0 1 2 3 4 … 𝑀−1
variable, set to vacate a worn PLA “just in time” to keep it usable
(a) Mapping sliding window. for all future remappings. In Section 3 we specify a formula for the
value of 𝜙, derived based on analysis of ECC-Map that is included
Function 𝑓! 𝑓" 𝑓# 𝑓$ 𝑓% 𝑓&'"
in the appendix. Although 𝜙 is given as a count of physical writes,
Index 0 1 2 3 4 … 𝑀−1 implementing the remapping trigger does not require maintaining
PLA write counters (which would be expensive). Instead, reaching
(b) Mapping sliding window after a movement. a threshold of 𝜙 can be detected by a reliability measurement of the
PLA, for example by counting the number of bit errors corrected
Figure 2: Sliding window of active mapping indices, for the by the decoder of the data error-correcting codes.
case 𝑆 = 4. (a) Initial set in shaded orange when base = 0. (b)
After incrementing to base = 1, the set moves to the window
in shaded blue. 2.3.2 Regular remapping. The vast majority of remapping oper-
ations follow the simple procedure we describe next. When 𝐿𝐿𝐴
triggers remapping, it is moved from mapping index 𝑖 to 𝑖 + 1. If
time, we note a practical disadvantage of using offset𝑖 (as in (2)) for 𝑓𝑖+1 (𝐿𝐿𝐴) is an unused PLA, the remapping is complete – we call
representing the index, due to the need to update offset𝑖 when the this a non-colliding regular remapping. In a colliding regular remap-
window slides to a subsequent base, even if 𝑖 is unchanged. Instead, ping, before writing 𝐿𝐿𝐴 to 𝑓𝑖+1 (𝐿𝐿𝐴), 𝐿𝐿𝐴′ currently mapped to
we propose an alternative compact representation: 𝑖¯ = 𝑖 mod 𝑆, this PLA with index 𝑗 is remapped to index 𝑗 + 𝛿, and 𝛿 is the
which can be used to recover 𝑖 using the formula smallest positive integer such that 𝑓 𝑗+𝛿 (𝐿𝐿𝐴′ ) is an unused PLA.
𝑖 = base + ((𝑖¯ − base) mod 𝑆). (3) The procedure guarantees the movement of 𝐿𝐿𝐴 to index 𝑖 + 1, and
in case of collision, moves 𝐿𝐿𝐴′ out to a free PLA. The rationale
It is clear that 𝑖¯ remains unchanged if 𝑖 is unchanged, even if base is behind giving 𝐿𝐿𝐴 the priority over 𝐿𝐿𝐴′ is to minimize the index
increased. For example, an LLA mapped with 𝑖 = 3 in Figure 2a has increase of host-written LLAs, thus allowing more writes before an
𝑖¯ = 3, which remains the same even after the window movement to LLA exits the index window.
base = 1 in Figure 2b. In this latter state, 𝑖 can be recovered by (3):
𝑖 = 1 + (3 − 1)mod 4 = 3.
2.3.3 Catch-up remapping. When an LLA needs to move during
2.3 Selective Remapping regular remapping to an index equal to or greater than base + 𝑆,
a catch-up procedure is invoked. In the catch-up procedure we
The unique feature of the proposed architecture is that different
first set a new base value greater than the current one, and then
LLAs can be mapped by different mapping functions (mapping
remap every LLA with index smaller than the new base to an index
indices). This feature allows selective remapping of heavily writ-
greater or equal to it. The amount by which base is shifted, as well
ten LLAs by incrementing their mapping index, while keeping
as the new indices chosen for the catching-up LLAs, are a matter
other LLAs at their current mapping indices and physical locations.
for optimization. In this work we simply set base ← base + 𝑆, and
Thanks to the device over-provisioning (𝑁 > 𝐾), it is possible to
remap all LLAs with smaller indices to the new base index. Other
remap an LLA with minimal change to the mapping of other LLAs.
catch-up algorithms (not used in this work) may alternatively add
A heavier remapping operation, called catch-up, occurs when the
less than 𝑆 to base, and/or move LLAs to indices strictly beyond
remapped LLA’s mapping index is incremented beyond the cur-
the new base.
rent index window. In this case, the index window needs to shift,
Figure 3a and Figure 3b illustrate the operations of regular remap-
and with it will move all the LLAs that are currently mapped by
ping described above. The plotted arrays represent the PLA space of
indices below its new base index. However, since 𝑆 ≫ 1, catch-up
the device, and the capital letters in the array are the LLAs mapped
events reflect a minuscule minority of the remapping events. We
to the corresponding PLAs. Initially, LLA 𝐴 is mapped by index 𝑖 as
give some more details on the remapping operations, starting with
shown at the top part of Figure 3a. Upon its triggered remapping,
how remapping events are triggered.
𝐴 moves to its PLA position at the bottom part by incrementing its
2.3.1 Remapping trigger. Selective remapping warrants the defini- index to 𝑖 + 1. In this remapping there is no collision with another
tion of a trigger event for moving a specific LLA from its current LLA. Figure 3b shows the next remapping of 𝐴, in which there is
PLA. A global write counter, used in most prior wear-leveling ar- collision with LLA 𝐹 ; to free the PLA to 𝐴, 𝐹 is moved to the PLA
chitectures (e.g., [14]), would not suffice in this case. Informally, we mapped to it by index 𝑗 + 3, because the lower indices 𝑗 + 1 and
remap an LLA written by the host if its current PLA has reached a 𝑗 + 2 map to used PLAs.
wear level that holds the risk of its premature failing. Toward this Figure 4 illustrates the catch-up procedure described above. For
end, we specify a wear threshold 𝜙 < 𝑤𝑚𝑎𝑥 , with the following 𝑆 = 4, the figure displays the mapping index of each LLA. Initially,
policy: a host write to an LLA mapped to a PLA that had exceeded base = 0 and 𝐴 has index 3 (top part). Upon remapping of 𝐴 (bottom
𝜙 writes will be written after remapping the LLA to a different part), its index is incremented to 4, which falls outside the current
PLA. Note that the policy is only applied to host writes; a write window {base = 0, 1, 2, 3 = 𝑆 − 1}, thus invoking catch-up. This
that is part of a remapping operation will not trigger an additional example implements the simple algorithm we use in this work:
remapping, even if the written PLA exceeded the threshold. This setting base ← base + 𝑆 = 4, and updating all LLAs to the new base.
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead
Before: B C D A E F G H I J K L
3 EVALUATION AND RESULTS
Before evaluating the proposed mapping architecture, we spec-
𝑖+1 𝑗+1 𝑗+2 𝑗+3 ify the formula we use to set the threshold parameter 𝜙 (See Sec-
𝑖+2
tion 2.3.1), as a function of the architecture parameters 𝑁 , 𝑤𝑚𝑎𝑥 , 𝑆.
The formula is based on a theoretical analysis of ECC-Map for the
After: B C D E A G H I J K F L
1-LLA workload that repeatedly writes to a single LLA until reach-
(b) Regular remapping - colliding case.
ing the device end of life. The detailed analysis can be found in the
appendix. We denote 𝛼 ≜ 𝜙/𝑤𝑚𝑎𝑥 as the fractional threshold, and
Figure 3: Illustration of regular remapping. (a) The non- set 𝛼 to be the following
colliding case. In this case, 𝐴 moves from index 𝑖 to 𝑖 + 1
1 − 𝑁 , 𝑤𝑚𝑎𝑥 𝑁 < 𝑆
𝛼 opt = 2 𝑆𝑤𝑚𝑎𝑥 3 (4)
and reaches a free PLA. (b) The colliding case. In this case, otherwise
3,
the next index 𝑖 + 2 maps 𝐴 to a used PLA, thus requiring the
movement of 𝐹 to a subsequent index that maps it to a free The subscript “opt” is used to mark the fact that this 𝛼 maximizes
PLA. the utilization on the 1-LLA workload, according to the model and
its analysis in the appendix.
Before: A B C D E F G H I J K L
3.1 Evaluation
3 0 0 2 1 1 3 0 2 3 2 1 Implementation. To evaluate the performance of the proposed ar-
chitecture, we implemented all of its ingredients in a Python-based
After: F A C B I G E H L D K J discrete-event simulator. The simulator accepts an arbitrary write
workload, and runs it through the proposed device mapping layer,
including exact management of the indexed mapping functions, and
4 4 4 4 4 4 4 4 4 4 4 4
performing all remappings (regular and catch-up). Upon reaching
the device end of life, that is, when a PLA first exceeds 𝑤𝑚𝑎𝑥 writes,
Figure 4: Illustration of the catch-up procedure. Initially,
the simulator stops and records the utilization value for this work-
base = 0 and all LLAs have indices in the range {0, 1, 2, 3}.
load. Using a software simulator allows us to examine the device
Then 𝐴 is remapped to index 4, invoking a catch-up procedure
performance in a large variety and broad range of system variables.
leading to base = 4 and all LLAs mapped with index 4.
The principal system variable in our evaluation, which turns out
to be the key performance determinant, is the size-to-endurance
The final ingredient of the proposed mapping architecture is ratio 𝑁 /𝑤𝑚𝑎𝑥 . In general, as 𝑁 /𝑤𝑚𝑎𝑥 grows, the more “difficult” it
index randomization, applied for the purpose of hiding the instan- becomes for a given mapping architecture to level the wear. This
taneous mapping functions from an adversary generating the write fact is observed in [14], with the inequality 1/𝜓 > 𝑁 /𝑤𝑚𝑎𝑥 , where
workload. the left-hand side is the frequency of internal-copy writes required
by the SG architecture. The same ratio 𝑁 /𝑤𝑚𝑎𝑥 also appears (twice)
2.4 Mapping-index Randomization in (4). The principal dependence on the ratio 𝑁 /𝑤𝑚𝑎𝑥 allows us
As we make the standard assumption that the mapping functions to use relatively small values in most tests (𝑁 = 1024, varying
used by the architecture are publicly known, an adversary may be 𝑤𝑚𝑎𝑥 ), significantly speeding up the evaluation. To prove that the
able to track the mapping of LLAs and issue writes to those mapped absolute values of 𝑁 ,𝑤𝑚𝑎𝑥 are secondary to their ratio, our evalua-
to high-wear PLAs. To prevent this, we add a pseudo-random trans- tions include tests we repeat with 4 and 16 times larger 𝑁 ,𝑤𝑚𝑎𝑥 ,
formation between the running mapping indices 1, 2, 3, . . . , 𝑁 − 1 showing no significant difference. We thus expect that much larger
and the actual mapping numbers fed to the mapping functions commercial devices with these size-to-endurance ratios – for exam-
(forward and inverse) of Figure 1a and Figure 1b. ple, a 𝑁 /𝑤𝑚𝑎𝑥 = 2 device with 2G-lines and endurance 1𝑒9 – will
For the transformation we use a standard linear-feedback shift perform similarly.
register (LFSR) [2, 18], which is initialized to a random seed gen- In addition to 𝑁 /𝑤𝑚𝑎𝑥 , other system variables we examine in
erated internally by the device. The random seed is the output our evaluation are the window size 𝑆, the spare factor 𝜌, and the
Natan Peled and Yuval Cassuto
trigger threshold 𝜙. For convenience, we list in Table 1 the default 3.2 Results
values we use for these system variables, unless noted otherwise We first use the default values from Table 1 and plot in Figure 5a the
(each result typically varies one variable, leaving the rest to their utilizations of the four architectures for each of the four workloads.
default values). It is first observed that ECC-Map significantly outperforms the three
prior architectures on the 1-LLA workload. RBSG’s performance
system variable default value is satisfactory only on the uniform workload, and SG’s is lower
Size-to-endurance ratio 𝑁 /𝑤𝑚𝑎𝑥 0.5 than ECC-Map on all workloads except Zipf, on which it is very
Window size 𝑆 32 close to ECC-Map. Sec-Mem’s performance on the stress workload
Spare factor 𝜌 20% is about a 1/3 worse than ECC-Map. On the uniform workload
Trigger threshold 𝜙 set by (4) to 𝛼 opt𝑤𝑚𝑎𝑥 all architectures have good performance, as expected, with Sec-
Table 1: Default values of system variables. Mem slightly ahead of ECC-Map coming second. We continue
the experiment with progressively larger size-to-endurance ratios
𝑁 /𝑤𝑚𝑎𝑥 . Recall from the discussion in Section 3.1 that larger ratios
Comparison. In addition to studying the performance of the are in general more difficult to wear-level. Moving from the default
proposed architecture, this section compares this performance to value of 0.5 in Figure 5a, we plot in Figures 5b-5e the utilizations for
the three state-of-the-art wear-leveling architectures for PCM: the four larger values of 𝑁 /𝑤𝑚𝑎𝑥 , each time multiplying it by 2. We get
SG and RBSG architecture [14], where the latter adds region parti- these ratios by fixing 𝑁 and halving 𝑤𝑚𝑎𝑥 successively. The value
tion on the former, and the region-based secure-PCM main-memory of 𝜙 is calculated using (4) for each tested ratio. We indeed see that
architecture [15] (Sec-Mem). The last of the three works by dynam- increasing the size-to-endurance ratio decreases the utilizations on
ically remapping a full region every certain number of host writes the 1-LLA and stress workloads. However, this decrease is much
to it since the last remapping. Note that dynamic region mapping more graceful in ECC-Map than in Sec-Mem, while both SG and
requires a mapping table with size linear in the number of regions. RBSG have near-zero utilizations on these workloads starting from
While there are follow-up works enhancing these architectures 𝑁 /𝑤𝑚𝑎𝑥 = 2. The performance of ECC-Map on the Zipf workload
in different ways, these three works are the best known for the even improves with larger size-to-endurance ratios, for reasons
standard device model we consider here. Therefore, we expect to that will be explained later in Section 3.2.3. The very low utilization
see similar advantage over other variants, which also use global values of both SG/RBSG and Sec-Mem throughout Figure 5 mean
or region write counters as a trigger for remapping. For the SG that these architectures cannot be used by devices with such size-
architecture we use a device with the same logical capacity 𝐾, and to-endurance ratios.
a single spare PLA, as specified in [14]; for RBSG we use the same
3.2.1 Dependence on the absolute device size. In the next experi-
𝐾 and 𝑁 parameters as ECC-Map. The secure PCM architecture
ment, we examine the (in)sensitivity of the results to the absolute
does not need spare, thus it is used with 𝑁 = 𝐾 (and the same
values of 𝑁 ,𝑤𝑚𝑎𝑥 , thus corroborating our claim that performance
𝐾). Note that the comparison is fair even if the values of 𝑁 are
is determined by their ratio 𝑁 /𝑤𝑚𝑎𝑥 . Toward that, we ran the same
not equal, because the utilization metric penalizes the increased 𝑁
workloads with the same ratios and three different device sizes
appropriately.
𝑁 = 1024, 4096, 16384 (with corresponding 𝑤𝑚𝑎𝑥 values). The re-
Measurements. We ran different write workloads, and for each
sults are recorded in Table 2, showing almost identical values of
architecture we counted the total number of writes (host and phys-
(logical) utilization between the different sizes.
ical) the device served until its end of life. We recorded several
performance metrics: the (logical) utilization (1), the number of host
3.2.2 Dependence on the window size 𝑆. In Figure 6, we plot the
writes, and the number of physical writes. We repeated each test
utilization’s dependence on the window-size parameter 𝑆. Recall
five times and averaged the results to smooth out the workload
that 𝑆 controls the mapping richness/complexity, so it is important
randomness.
to examine its effect on performance. We ran the workloads using
Workloads. We tested four write workloads in our evaluations:
the default device parameters, each time implementing a different
1) the 1-LLA workload, 2) the stress workload, 3) the uniform work-
window size 𝑆 = 16, 32, 64, 128. First, the results show that for all
load, and 4) the Zipfian distribution Zipf. In 1), we randomly choose
values of 𝑆 and all workloads, the architecture achieves significant
a single LLA, and write to it repeatedly until reaching end of life.
utilization (for comparison, we plot the RBSG 1-LLA utilization as
In 2), we randomly pick a 3% fraction of the LLAs and write only to
a horizontal line). It can be observed that significant improvement
them, where the selection within the set is uniform. In 3), each write
is offered to the 1-LLA workload when increasing 𝑆 from 16 to 32,
draws an LLA uniformly from the entire space. In 4), each write
while subsequent increases give more modest advantages. That
draws an LLA from the whole address space with a non-uniform
1/𝑖 means that for these device parameters, 𝑆 = 32 may be the right
selection that follows the distribution 𝑝 (𝑖, 𝐾) = Í𝐾 , where 𝑖 compromise between performance and mapping cost. It can also
𝑛=1 1/𝑛
is the LLA’s sequence number and 𝐾 is the number of LLAs. The be seen that the uniform and stress workloads are less sensitive
1-LLA workload is the key motivation of this work, hence it will be to the value of 𝑆. This is because more balanced workloads have
the focus of the evaluation. The stress and Zipf workloads model more balanced mapping-index distributions, and thus fewer catch-
other challenging write patterns that the device needs to handle, up remappings even when 𝑆 is small. The stress workload sees
and the “easier” uniform workload is included mainly as reference, some small utilization decrease in 𝑆 = 128, which can be attributed
since it is handled well by prior wear-leveling architectures. to the fact that 𝜙 is optimized for the 1-LLA workload. In a real
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead
1 1 1
Utilization
Utilization
0.6 0.6 0.6
0 0 0
1-LLA Stress Uniform Zipf 1-LLA Stress Uniform Zipf 1-LLA Stress Uniform Zipf
𝑁 𝑁 𝑁
(a) 𝑤𝑚𝑎𝑥 = 0.5 (b) 𝑤𝑚𝑎𝑥 =1 (c) 𝑤𝑚𝑎𝑥 =2
1 1
0.8 0.8
Utilization
Utilization
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1-LLA Stress Uniform Zipf 1-LLA Stress Uniform Zipf
𝑁 𝑁
(d) 𝑤𝑚𝑎𝑥 =4 (e) 𝑤𝑚𝑎𝑥 =8
implementation of the architecture, one may choose to set 𝜙 to 0.8. The effect of this is demonstrated in Figure 7, comparing the
jointly optimize for different workloads, but good utilization is performance of ECC-Map with and without this modification. It can
achieved even with the simple formula used in this work. be seen that the Zipf utilization increased from around 0.4 (as also
3.2.3 Dependence on the remapping threshold 𝜙. To expand on seen in Figure 5a) to over 0.7. At the same time, the modification did
the issue of optimizing the threshold 𝜙, we point back to Figure 5 decrease the 1-LLA (and stress) utilizations, but not significantly
and note that for 𝑁 /𝑤𝑚𝑎𝑥 = 0.5, the value of the fractional ratio so.
𝜙/𝑤𝑚𝑎𝑥 according to (4) is as high as 1−1/64. Such high thresholds, To validate the correctness of the 𝜙 opt derived in (4), we next
while optimal for the 1-LLA workload, limit the performance on want to see the utilization as a function of 𝜙. We define 𝜙 opt ≜ 𝛼 opt ·
the Zipf workload. Thus, a possible solution is to set 𝜙/𝑤𝑚𝑎𝑥 as the 𝑤𝑚𝑎𝑥 , and for convenience plot in Figure 8 the utilization as a func-
minimum between the outcome of (4) and a predefined limit, e.g., tion of 𝜙/𝜙 opt −1. The x-axis point 0 represents the value of 𝜙 = 𝜙 opt
Natan Peled and Yuval Cassuto
1 1
0.8 0.8
Utilization
Utilization
0.6 0.6
0.4 0.4
1-LLA 1-LLA
Uniform Uniform
0.2 Stress 0.2 Stress
RBSG-Advers Zipf
0 0
16 32 64 128 −0.3 −0.2 −0.1 0 0.1
S ϕ/ϕopt − 1
Figure 6: Utilization as a function of the mapping window
size 𝑆. Figure 8: Utilization as the threshold-trigger 𝜙 is varied from
1 its optimized value.
0.8
1
Utilization
0.6
0.8
Utilization
0.4
0.6
0.2
0.4
0 0.2
1-LLA Stress Uniform Zipf
0
ϕopt min (ϕopt , 0.8wmax )
10% 15% 20% 25%
Prior wear-leveling solutions, for both Flash and PCM devices, multiple block remapping thresholds, reducing the number of writes
use a variety of techniques, at different parts of the memory stack: between remappings as the device ages.
from the operating system to the physical representation of cell
levels. We now briefly mention a non-exhaustive sample of these
techniques.
6 CONCLUSION
In this work, we present ECC-Map, a novel wear-leveling scheme
5.1 Wear-leveling and Related Techniques for for persistent memories that can handle even the most unbalanced
workloads. A family of efficient functions based on ECC encoders
PCM Devices provides flexible and economical mapping, and enables remap-
The most celebrated PCM wear-leveling architecture is Start-Gap [14], ping operations that are more targeted to the incident workload.
thanks to its simplicity and extremely efficient mapping layer. The ECC-Map’s remapping algorithms are extremely simple, which is
key technique used in Start-Gap [14] is periodical line shifting, important for implementation on device controllers. Toward that,
called therein gap movements. In addition, it proposes to divide the many interesting topics are left for future work. Among them: 1)
device to regions to better mitigate extremely unbalanced workloads the organization of the mapping meta-data on the memory me-
(though at the cost of unused endurance in some regions). In [26] dia, 2) the optimization and scheduling of remapping operations,
and [15], line shifting is complemented by region swapping for im- and 3) further improvements to the proposed mapping algorithms,
proved wear spreading. While region swapping helps, its flexibility offering interesting tradeoffs among different workloads.
for claiming unused endurance depends on the region size, and fine
region partition requires large mapping tables. Region swapping
is further enhanced in [27] by considering endurance variation 7 ACKNOWLEDGEMENT
among different regions for selecting the swap target. Exploiting This work was supported in part by the Israel Science Foundation
variation is a useful technique, complementary to the design of the under grant number 2525/19.
mapping architecture, and can also enhance the proposed ECC-Map
architecture. Another technique used in almost all wear-leveling
REFERENCES
architectures is address randomization for hiding the mapping from
[1] Dmytro Apalkov, Alexey Khvalkovskiy, Steven Watts, Vladimir Nikitin, Xueti
an adversary, as we implement here in ECC-Map. Tang, Daniel Lottis, Kiseok Moon, Xiao Luo, Eugene Chen, and Adrian Ong.
Additional works address wear leveling as part of larger archi- Spin-transfer torque magnetic random access memory (STT-MRAM). ACM
Journal on Emerging Technologies in Computing Systems (JETC), 9(2):1–35, 2013.
tectural settings, building on the techniques mentioned in the previ- [2] Paul H. Bardell, William H. McAnney, and Jacob Savir. Built-in Test for VLSI:
ous paragraph. [3–5] incorporate wear leveling into the operating- Pseudorandom Techniques. Wiley-Interscience, 1987.
system stack; [24, 25] combine PCM and DRAM (the latter having [3] Yu-Ming Chang, Pi-Cheng Hsiu, Yuan-Hao Chang, Chi-Hao Chen, Tei-Wei Kuo,
and Cheng-Yuan Michael Wang. Improving PCM Endurance with a Constant-
much higher endurance); and [23] proposes a novel hardware ad- Cost Wear Leveling Design. ACM Trans. Des. Autom. Electron. Syst., 22(1), jun
dress decoder (PRAD) that can help in wear leveling (among other 2016.
[4] Chi-Hao Chen, Pi-Cheng Hsiu, Tei-Wei Kuo, Chia-Lin Yang, and Cheng-
things). Yuan Michael Wang. Age-Based PCM Wear Leveling with Nearly Zero Search
A vastly studied approach, related to wear leveling, is wear re- Cost. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12,
duction. [6] proposes a physical-writing mechanism for PCM that page 453–458, New York, NY, USA, 2012. Association for Computing Machinery.
[5] Sheng-Wei Cheng, Yuan-Hao Chang, Tseng-Yi Chen, Yu-Fen Chang, Hsin-Wen
reduces the write wear. [10] uses information from the L1 cache Wei, and Wei-Kuan Shih. Efficient Warranty-Aware Wear Leveling for Embed-
to write only modified data to the PCM media. A similar objec- ded Systems With PCM Main Memory. IEEE Transactions on Very Large Scale
tive is pursued in [8], which in addition presents a wear-leveling Integration (VLSI) Systems, 24(7):2535–2547, 2016.
[6] Sangyeun Cho and Hyunjin Lee. Flip-N-Write: A Simple Deterministic Technique
scheme in PCM when it acts as a cache. Wear reduction techniques to Improve PRAM Write Performance, Energy and Endurance. In Proceedings
are extremely useful in practice, and can similarly enhance the of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO 42, page 347–357, New York, NY, USA, 2009. Association for Computing
performance of ECC-Map. Machinery.
[7] George C. Clark and J. Bibb. Cain. Error-correction coding for digital communica-
5.2 Wear-leveling Techniques for Flash Devices tions. Plenum Press New York, 1981.
[8] Yongsoo Joo, Dimin Niu, Xiangyu Dong, Guangyu Sun, Naehyuck Chang, and
Flash-based memories differ from PCM and newer persistent mem- Yuan Xie. Energy-and endurance-aware design of phase change memory caches.
In 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010),
ories in their internal structure of large update units (called blocks), pages 136–141. IEEE, 2010.
each comprising many lines (known as pages). Due to this struc- [9] Miguel Angel Lastras-Montano and Kwang-Ting Cheng. Resistive random-access
ture, Flash wear leveling is done at the larger block granularity, and memory based on ratioed memristors. Nature Electronics, 1(8):466–472, 2018.
[10] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Architecting Phase
assuming the availability of a translation layer supporting flexible Change Memory as a Scalable Dram Alternative. SIGARCH Comput. Archit. News,
logical to physical mapping. Most of the techniques use a table 37(3):2–13, jun 2009.
that tracks the wear of each data block, hence picking low-wear [11] Dongzhe Ma, Jianhua Feng, and Guoliang Li. A survey of address translation
technologies for flash memories. ACM Comput. Surv., 46(3), jan 2014.
blocks for the incoming host writes. [16] considers the endurance [12] Frederic P. Miller, Agnes F. Vandome, and John McBrewster. Cyclic Redundancy
variability among different blocks, and tabulates block reliability Check: Computation of CRC, Mathematics of CRC, Error Detection and Correction,
Cyclic Code, List of Hash Functions, Parity Bit, Information ... Cksum, Adler- 32,
statistics based on measuring program error rate. ECC-Map can also Fletcher’s Checksum. Alpha Press, 2009.
be extended to consider variability, by setting variable 𝜙 thresholds [13] Ardavan Pedram, Stephen Richardson, Mark Horowitz, Sameh Galal, and Shahar
for different parts of the device. [19] further extends the reliability Kvatinsky. Dark Memory and Accelerator-Rich System Optimization in the Dark
Silicon Era. IEEE Design & Test, 34(2):39–50, 2017.
estimation by considering retention errors through time measure- [14] Moinuddin K. Qureshi, John Karidis, Michele Franceschini, Vijayalakshmi Srini-
ments between consecutive program cycles. [22] suggests using vasan, Luis Lastras, and Bulent Abali. Enhancing lifetime and security of
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead
PCM-based Main Memory with Start-Gap Wear Leveling. In 2009 42nd An-
nual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages
14–23, 2009.
[15] Andre Seznec. A Phase Change Memory as a Secure Main Memory. IEEE
Computer Architecture Letters, 9(1):5–8, 2010.
[16] Xin Shi, Fei Wu, Shunzhuo Wang, Changsheng Xie, and Zhonghai Lu. Program
error rate-based wear leveling for NAND flash memory. In 2018 Design, Automa-
tion & Test in Europe Conference & Exhibition (DATE), pages 1241–1246. IEEE,
2018.
[17] Tae-Sun Chung and Dong-Joo Park and Sangwon Park and Dong-Ho Lee and
Sang-Won Lee and Ha-Joo Song. A survey of flash translation layer. Journal of
Systems Architecture, 55(5):332–343, 2009.
[18] Thomas E. Tkacik. A hardware random number generator. In Burton S. Kaliski,
çetin K. Koç, and Christof Paar, editors, Cryptographic Hardware and Embedded
Systems - CHES 2002, pages 450–453, Berlin, Heidelberg, 2003. Springer Berlin
Heidelberg.
[19] Debao Wei, Liyan Qiao, Xiaoyu Chen, Mengqi Hao, and Xiyuan Peng. SREA: A
self-recovery effect aware wear-leveling strategy for the reliability extension of
NAND flash memory. Microelectronics Reliability, 100-101:113433, 2019. 30th
European Symposium on Reliability of Electron Devices, Failure Physics and
Analysis.
[20] H-S Philip Wong, Simone Raoux, SangBum Kim, Jiale Liang, John P Reifenberg,
Bipin Rajendran, Mehdi Asheghi, and Kenneth E Goodson. Phase change memory.
Proceedings of the IEEE, 98(12):2201–2227, 2010.
[21] Ming-Chang Yang, Yu-Ming Chang, Che-Wei Tsao, Po-Chun Huang, Yuan-Hao
Chang, and Tei-Wei Kuo. Garbage collection and wear leveling for flash memory:
Past and future. In 2014 International Conference on Smart Computing, pages
66–73, 2014.
[22] Yuan Hua Yang, Xian Bin Xu, Shui Bing He, Fang Zhen, and Yu Ping Zhang.
WLVT: A Static Wear-Leveling Algorithm with Variable Threshold. In Advanced
Materials Research, volume 756, pages 3131–3135. Trans Tech Publ, 2013.
[23] Leonid Yavits, Lois Orosa, Suyash Mahar, João Dinis Ferreira, Mattan Erez, Ran
Ginosar, and Onur Mutlu. WoLFRaM: Enhancing Wear-Leveling and Fault
Tolerance in Resistive Memories using Programmable Address Decoders. In 2020
IEEE 38th International Conference on Computer Design (ICCD), pages 187–196,
2020.
[24] HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A. Harding, and
Onur Mutlu. Row buffer locality aware caching policies for hybrid memories.
In 2012 IEEE 30th International Conference on Computer Design (ICCD), pages
337–344, 2012.
[25] Wangyuan Zhang and Tao Li. Characterizing and mitigating the impact of
process variations on phase change based memory systems. In 2009 42nd Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 2–13,
2009.
[26] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A Durable and Energy Efficient
Main Memory Using Phase Change Memory Technology. SIGARCH Comput.
Archit. News, 37(3):14–23, jun 2009.
[27] Wen Zhou, Dan Feng, Yu Hua, Jingning Liu, Fangting Huang, and Pengfei Zuo.
Increasing lifetime and security of phase-change memory with endurance varia-
tion. In 2016 IEEE 22nd International conference on parallel and distributed systems
(ICPADS), pages 861–868. IEEE, 2016.
Natan Peled and Yuval Cassuto