Murugan 2011
Murugan 2011
Abstract—NAND flash memory is fast replacing traditional of read and write operations. NAND flash memory has two
magnetic storage media due to its better performance and low variants namely SLC (Single Level Cell) and MLC (Multi
power requirements. However the endurance of flash memory Level Cell). SLC devices store one bit per cell while MLC
is still a critical issue in using it for large scale enterprise
applications. Rethinking the basic design of NAND flash memory devices store more than one bit per cell. Flash memory-based
is essential to realize its maximum potential in large scale storage. storage has several unique features that distinguish it from
NAND flash memory is organized as blocks and blocks in turn conventional disks. Some of them are listed below.
have pages. A block can be erased reliably only for a limited
number of times and frequent block erase operations to a few 1) Uniform Read Access Latency: In conventional magnetic
blocks reduce the lifetime of the flash memory. Wear leveling
helps to prevent the early wear out of blocks in the flash disks, the access time is dominated by the time required
memory. In order to achieve efficient wear leveling, data is moved for the head to find the right track (seek time) followed
around throughout the flash memory. The existing wear leveling by a rotational delay to find the right sector (rotational
algorithms do not scale for large scale NAND flash based SSDs. latency). As a result, the time to read a block of random
In this paper we propose a static wear leveling algorithm, named data from a magnetic disk depends primarily on the
as Rejuvenator, for large scale NAND flash memory. Rejuvenator
is adaptive to the changes in workloads and minimizes the cost of physical location of that data. In contrast, flash memory
expensive data migrations. Our evaluation of Rejuvenator is based does not have any mechanical parts and hence flash
on detailed simulations with large scale enterprise workloads and memory - based storage provides uniformly fast random
synthetic micro benchmarks. read access to all areas of the device independent of its
address or physical location.
2) Asymmetric read and write accesses: In conventional
I. I NTRODUCTION
magnetic disks, the read and write times to the same
With recent technological trends, it is evident that NAND location in the disk, are approximately the same. In
flash memory has enormous potential to overcome the short- flash memory-based storage, in contrast, writes are sub-
comings of conventional magnetic media. Flash memory has stantially slower than reads. Furthermore, all writes in a
already become the primary non-volatile data storage medium flash memory must be preceded by an erase operation,
for mobile devices, such as cell phones, digital cameras and unless the writes are performed on a cleaned (previously
sensor devices. Flash memory is popular among these devices erased) block. Read and write operations are done at the
due to its small size, light weight, low power consumption, page level while erase operations are done at the block
high shock resistance and fast read performance [1], [2]. level. This leads to an asymmetry in the latencies for
Recently, the popularity of flash memory has also extended read and write operations.
from embedded devices to laptops, PCs and enterprise-class 3) Wear out of blocks: Frequent block erase operations
servers with flash-based Solid State Disks (SSDs) widely being reduce the lifetime of flash memory. Due to the physical
considered as a replacement for magnetic disks. Research characteristics of NAND flash memory, the number of
works have been proposed to use NAND flash at different times that a block can be reliably erased is limited. This
levels in the I/O hierarchy [3], [4]. However NAND flash is known as wear out problem. For an SLC flash memory
memory has inherent reliability issues and it is essential to the number of times a block can be reliably erased is
solve the basic issues with NAND flash memory to fully utilize around 100𝐾 and for an MLC flash memory it is around
its potential for large scale storage. 10𝐾 [1].
NAND flash memory is organized as an array of blocks. A 4) Garbage Collection: Every page in flash memory is
block spans 32 to 64 pages, where a page is the smallest unit in one of the three states - valid, invalid and clean.
Valid pages contain data that is still valid. Invalid pages
978-1-4577-0428-4/11/$26.00 ⃝
c 2011 IEEE contain data that is dirty and is no more valid. Clean
pages are those that are already in erased state and
can accommodate new data in them. When the number
of clean pages in the flash memory device is low, unevenness in the distribution of wear in the blocks.
the process of garbage collection is triggered. Garbage 2) Static wear leveling: In contrast to dynamic wear level-
collection reclaims the pages that are invalid by erasing ing algorithms, static wear leveling algorithms attempt to
them. Since erase operations can only be done at the move cold data to more worn blocks thereby facilitating
block level, valid pages are copied elsewhere and then more even spread of wear. However, moving cold data
the block is erased. Garbage collection needs to be around without any update requests incurs overhead.
done efficiently because frequent erase operations during Rejuvenator is a static wear leveling algorithm. It is impor-
garbage collection can reduce the lifetime of blocks. tant that the expensive work of migrating cold data during
5) Write Amplification: In case of hard disks, the user static wear leveling is done optimally and does not create
write requests match the actual physical writes to the excessive overhead. Our goal in this paper is to minimize this
device. However in the case of SSDs, wear leveling and overhead and still achieve better wear leveling.
garbage collection activities cause the user data to be Most of the existing wear leveling algorithms have been
rewritten elsewhere without any actual write requests. designed for use of flash memory in embedded devices or
This phenomenon is termed as write amplification [5]. laptops. However the application of flash memory in large
It is defined as follows scale SSDs as a full fledged storage medium for enterprise
𝐴𝑐𝑡𝑢𝑎𝑙 𝑛𝑜. 𝑜𝑓 𝑝𝑎𝑔𝑒 𝑤𝑟𝑖𝑡𝑒𝑠
storage requires a rethinking of the design of flash memory
Write Amplification = 𝑁 𝑜. 𝑜𝑓 𝑢𝑠𝑒𝑟 𝑝𝑎𝑔𝑒 𝑤𝑟𝑖𝑡𝑒𝑠 right from the basic FTL components. With this motivation,
we have designed a wear leveling algorithm that scales for
6) Flash Translation Layer (FTL): Most recent high per- large capacity flash memory and guarantees the required
formance SSDs [6], [7] have a Flash Translations Layer performance for enterprise storage.
(FTL) to manage the flash memory. FTL hides the inter- By carefully examining the existing wear leveling algo-
nal organization of NAND flash memory and presents rithms, we have made the following observations. First, one
a block device to the file system layer. FTL maps the important aspect of using flash memory is to take advantage
logical address space to the physical locations in the of hot and cold data. If hot data is being written repeatedly
flash memory. FTL is also responsible for wear leveling to a few blocks then those blocks may wear out sooner than
and garbage collection operations. Works have also been the blocks that store cold data. Moreover, the need to increase
proposed [8] to replace the FTL with other mechanisms the efficiency of garbage collection makes placement of hot
with the file system taking care of the functionalities of and cold data very crucial. Second, a natural way to balance
the FTL. the wearing of all data blocks is to store hot data in less
In this paper, our focus is on the wear out problem. A wear worn blocks and cold data in most worn blocks. Third, most
leveling algorithm aims to even out the wearing of different of the existing algorithms focus too much on reducing the
blocks of the flash memory. A block is said to be worn out, wearing difference of all blocks throughout the lifetime of
when it has been erased the maximum possible number of flash memory. This tends to generate additional migrations
times. In this paper we define the lifetime of flash memory of cold data to the most worn blocks. The writes generated
as the number of updates that can be executed before the first by this type of migrations are considered as an overhead and
block is worn out. This is also called the first failure time [9]. may reduce the lifetime of flash memory. While trying to
The primary goal of any wear leveling algorithm is to increase balance the wear more often might be necessary for small
the lifetime of flash memory by preventing any single block scale embedded flash devices, this is not necessary for large
from reaching the 100𝐾 erasure cycle limit (we are assuming scale flash memory where performance is more critical. In
SLC flash). Our goal is to design an efficient wear leveling fact, a good wear leveling algorithm needs to balance the
algorithm for flash memory. wearing level of all blocks aggressively only towards the end
The data that is updated more frequently is defined as hot of flash memory lifetime. This would improve the performance
data, while the data that is relatively unchanged is defined as of the flash memory. These are the basic principles behind
cold data. Optimizing the placement of hot and cold data in the design and implementation of Rejuvenator. We named
the flash memory assumes utmost importance given the limited our wear leveling algorithm Rejuvenator because it prevents
number of erase cycles of a flash block. If hot data is being the blocks from reaching their lifetime faster and keeps them
written repeatedly to certain blocks, then those blocks may young.
wear out much faster than the blocks that store cold data. Rejuvenator minimizes the number of stale cold data migra-
The existing approaches to wear leveling fall into two broad tions and also spreads out the wear evenly by means of a fine
categories. grained management of blocks. Rejuvenator clusters the blocks
1) Dynamic wear leveling: These algorithms achieve wear into different groups based on their current erase counts. Reju-
leveling by repeatedly reusing blocks with lesser erase venator places hot data in blocks in lower numbered clusters
counts. However these algorithms do not attempt to and cold data in blocks in the higher numbered clusters. The
move cold data that may remain forever in a few blocks. range of the clusters is restricted within a threshold value.
These blocks that store cold data wear out very slowly This threshold value is adapted according to the erase counts
relative to other blocks. This results in a high degree of of the blocks. Our experimental results show that Rejuvenator
outperforms the existing wear leveling algorithms. nearing the maximum erase count limit. Blocks with larger
The rest of the paper is organized as follows. Section II erase counts are recycled with lesser probability. Thereby the
gives a brief overview of existing wear leveling algorithms. wear leveling efficiency and cleaning efficiency are optimized.
Section III explains Rejuvenator in detail. Section IV provides Static wear leveling is performed by storing cold data in the
performance analysis and experimental results. Section V more worn blocks and making the least worn blocks available
concludes the paper. for new updates. The cold data migration adds 4.7% to the
average I/O operational latency.
II. BACKGROUND AND R ELATED W ORK
The dual pool algorithm proposed by L.P. Chang [16]
As mentioned above, the existing wear leveling algorithms maintains two pools of blocks - hot and cold. The blocks are
fall into two broad categories - static and dynamic. Dynamic initially assigned to the hot and cold pools randomly. Then
wear leveling algorithms are used due to their simplicity in as updates are done the pool associations become stable and
management. Blocks with lesser erase counts are used to store blocks that store hot data are associated with the hot pool and
hot data. L.P. Chang et al. [10] propose the use of an adaptive the blocks that store cold data are associated with cod pool. If
striping architecture for flash memory with multiple banks. some block in the hot pool is erased beyond a certain threshold
Their wear leveling scheme allocates hot data to the banks that its contents are swapped with those of the least worn block
have least erase count. However as mentioned earlier, cold data in cold pool. The algorithm takes a long time for the pool
remains in a few blocks and becomes stale. This contributes to associations of blocks to become stable. There could be a lot
a higher variance in the erase counts of the blocks. We do not of data migrations before the blocks are correctly associated
discuss further about dynamic wear leveling algorithms since with the appropriate pools. Also the dual pool algorithm does
they obviously do a very poor job in leveling the wear. not explicitly consider cleaning efficiency. This can result in
TrueFFS [11] wear leveling mechanism maps a virtual erase an increased number of valid pages to be copied from one
unit to a chain of physical erase units. When there are no free block to another.
physical units left in the free pool, folding occurs where the Besides wear leveling, other mechanisms like garbage col-
mapping of each virtual erase unit is changed from a chain lection and mapping of logical to physical blocks also affect
of physical units to one physical unit. The valid data in the the performance and lifetime of the flash memory. Many works
chain is copied to a single physical unit and the remaining have been proposed for efficient garbage collection in flash
physical units in the chain are freed. This guarantees a uniform memory [17], [18], [19]. The mapping of logical to physical
distribution of erase counts for blocks storing dynamic data. memory can be at a fine granularity at the page level or at a
Static wear leveling is done on a periodic basis and virtual coarse granularity at the block level. The mapping tables are
units are folded in a round robin fashion. This mechanism generally maintained in the RAM. The page level mapping
is not adaptive and still has a high variance in erase counts technique consumes enormous memory since it contains map-
depending on the frequency in which the static wear leveling ping information about every page. Lee et al. [20] propose
is done. An alternative to the periodic static data migration is the use of a hybrid mapping scheme to get the performance
to swap the data in the most worn block and the least worn benefits of page level mapping and space efficiency of block
block [12]. JFFS [13] and STMicroelectronics [14] use very level mapping. Lee et al. [21] and Kang et al. [22] also propose
similar techniques for wear leveling. similar hybrid mapping schemes that utilize both page and
Chang et al. [9] propose a static wear leveling algorithm block level mapping. All the hybrid mapping schemes use a set
in which a Bit Erase Table (BET) is maintained as an array of log blocks to capture the updates and then write them to the
of bits where each bit corresponds to 2𝑘 contiguous blocks. corresponding data blocks. The log blocks are page mapped
Whenever a block is erased the corresponding bit is set. Static while data blocks are block mapped. Gupta et al. propose a
wear leveling is invoked when the ratio of the total erase count demand based page level mapping scheme called DFTL [23].
of all blocks to the total number of bits set in the BET is DFTL caches a portion of the page mapping table in RAM
above a threshold. This algorithm still may lead to more than and the rest of the page mapping table is stored in the flash
necessary cold data migrations depending on the number of memory itself. This reduces the memory requirements for the
blocks in the set of 2𝑘 contiguous blocks. The choice of the page mapping table.
value of 𝑘 heavily influences the performance of the algorithm.
If the value of 𝑘 is small the size of the BET is very large.
However if the value of 𝑘 is higher, the expensive work of
moving cold data is done more than often. III. R EJUVENATOR ALGORITHM
The cleaning efficiency of a block is high if it has lesser
number of valid pages. Agrawal et al. [15] propose a wear In this section we describe the working of the Rejuvenator
leveling algorithm which tries to balance the tradeoff between algorithm. The management operations for flash memory have
cleaning efficiency and the efficiency of wear-leveling. The to be carried out with minimum overhead. The design objective
recycling of hot blocks is not completely stopped. Instead of Rejuvenator is to achieve wear leveling with minimized per-
the probability of restricting the recycling of a block is formance overhead and also create opportunities for efficient
progressively increased as the erase count of the block is garbage collection.
minimum erase count of any block is less than or equal to
the threshold 𝜏 . Each block is associated with the list number
equal to its erase count. Some lists may be empty. Initially all
blocks are associated with list number 0. As blocks are updated
they get promoted to the higher numbered lists. Let us denote
the minimum erase count as min wear and the maximum erase
count as max wear. Let the difference between max wear and
min wear be denoted as diff. Every block can have three types
Fig. 1. Working of Rejuvenator algorithm of pages: valid pages, invalid pages and clean pages. Valid
pages contain valid or live data. Invalid pages contain data
that is no more valid or dead. Clean pages contain no data.
A. Overview Let 𝑚 be an intermediate value between min wear and
As with any wear leveling algorithm the objective of Rejuve- min wear + (𝜏 − 1). The blocks that have their erase counts
nator is to keep the variance in erase counts of the blocks to a between min wear and min wear + (𝑚 − 1) are used for
minimum so that no single block reaches its lifetime faster than storing hot data and the blocks that belong to higher numbered
others. Traditional wear leveling algorithms were designed for lists are used to store cold data in them. This is the key idea
use of flash memory in embedded systems and their main focus behind which the algorithm operates. Algorithm 1 depicts the
was to improve the lifetime. With the use of flash memory working of the proposed wear leveling technique. Algorithm 2
in large scale SSDs, the wear leveling strategies have to be shows the static wear leveling mechanism. Algorithm 1 clearly
designed considering performance factors to a greater extent. tries to store hot data in blocks in the lists numbered min wear
Rejuvenator operates at a fine granularity and hence is able to and min wear + (𝑚 − 1). These are the blocks that have been
achieve better management of flash blocks. erased lesser number of times and hence have more endurance.
As mentioned before Rejuvenator tries to map hot data From now, we call list numbers min wear to min wear +
to least worn blocks and cold data to more worn blocks. (𝑚 − 1) as lower numbered lists and list numbers min wear
Unlike the dual pool algorithm and the other existing wear + 𝑚 to min wear + (𝜏 − 1) as higher numbered lists.
leveling algorithms, Rejuvenator explicitly identifies hot data As mentioned earlier, blocks in lower numbered lists are
and allocates it in appropriate blocks. The definition of hot page mapped and blocks in the higher numbered lists are block
and cold data is in terms of logical addresses. These logical mapped. Consider the case where a single page in a block
addresses are mapped to physical addresses. We maintain a that has a block level mapping becomes hot. There are two
page level mapping for blocks storing hot data and a block options to handle this situation. The first option is to change
level mapping for blocks storing cold data. The intuition the mapping of every page in the block to page-level. The
behind this mapping scheme is that hot pages get updated second option is to change the mapping for the hot page alone
frequently and hence the mapping is invalidated at a faster to page level and leave the rest of the block to be mapped
rate than cold pages. Moreover in all of the workloads that at the block level. We adopt the latter method. This leaves
we used, the number of pages that were actually hot is a very the blocks fragmented since physical pages corresponding to
small fraction of the entire address space. Hence the memory the hot pages still contain invalid data. We argue that this
overhead for maintaining the page level mapping for hot pages fragmentation is still acceptable since it avoids unnecessary
is very small. This idea is inspired by the hybrid mapping page level mappings. In our experiments we found that the
schemes that have already been proposed in literature [20], fragmentation was less than 0.001% of the entire flash memory
[21], [22]. The hybrid FTLs typically maintain a block level capacity.
mapping for the data blocks and a page level mapping for the Algorithm 1 explains the steps carried out when a write
update/log blocks. request to an LBA arrives. Consider an update to an LBA. If
The identification of hot and cold data is an integral part the LBA already has a physical mapping, let 𝑒 be the erase
of Rejuvenator. We use a simple window based scheme with count of the block corresponding to the LBA. When a hot
counters to determine which logical addresses are hot. The page in the lower numbered lists is updated, a new page from
size of the window is fixed and it covers the logical addresses a block belonging to the lower numbered lists is used. This is
that were accessed in the recent past. At any point in time the done to retain the hot data in the blocks in the lower numbered
logical addresses that have the highest counter values inside lists. When the update is to a page in the lower numbered lists
the window are considered hot. The hot data identification and it is identified as cold, we check for a block mapping for
algorithm can be replaced by any sophisticated schemes that that LBA. If there is an existing block mapping for the LBA,
are available already [24], [25]. However in this paper we stick since the LBA had a page mapping already, the corresponding
to the simple scheme. page in the mapped physical block will be free or invalid.
The data is written to the corresponding page in the mapped
B. Basic Algorithm physical block (if the physical page is free) or to a log block
Rejuvenator maintains 𝜏 lists of blocks. The difference (if the physical page is marked invalid and not free). If there
between the maximum erase count of any block and the is no block mapping associated with the LBA, it is written to
Algorithm 1 Working of Rejuvenator C. Garbage Collection
Event = Write request to LBA Garbage collection is done starting from blocks in the lowest
if LBA has a pagemap then numbered list and then moving to higher numbered lists. The
if LBA is hot then reasons behind these are two fold. The first reason is that since
Write to a page in lower numbered lists blocks in the lower numbered lists store hot data, they tend
Update pagemap to have more invalid pages. We define cleaning efficiency of
else a block as follows.
Write to a page in higher numbered lists (or to log 𝑁 𝑜. 𝑜𝑓 𝑖𝑛𝑣𝑎𝑙𝑖𝑑 𝑎𝑛𝑑 𝑐𝑙𝑒𝑎𝑛 𝑝𝑎𝑔𝑒𝑠
block) Cleaning Efficiency = 𝑇 𝑜𝑡𝑎𝑙 𝑛𝑜. 𝑜𝑓 𝑝𝑎𝑔𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑏𝑙𝑜𝑐𝑘
Update blockmap
end if If the cleaning efficiency of a block is high, lesser pages
need to be copied before erasing the block. Intuitively the
else if LBA is hot then blocks in the lower numbered lists have a higher cleaning
Write to a page in lower numbered lists efficiency since they store hot data. The second reason for
Invalidate (data) any associated blockmap garbage collecting from lower numbered lists is that, the
Update pagemap blocks in these lists have lesser erase counts. Since garbage
else if LBA is cold then collection involves erase operations, it is always better to
Write to a page in higher numbered lists (or to log block) garbage collect blocks with lesser erase counts first.
Update blockmap
end if Algorithm 2 Data Migrations
if No. of clean blocks in lower numbered lists < 𝑇𝐿 then
Migrate data from blocks in list number min wear to
blocks in higher numbered lists
Garbage collect blocks in list numbers min wear and
one of the clean blocks belonging to the higher numbered lists min wear + (𝜏 − 1)
so that the cold data is placed in a block in the more worn end if
blocks. if No. of clean blocks in higher numbered lists < 𝑇𝐻 then
Migrate data from blocks in list number min wear to
Similarly when a page in the blocks belonging to higher blocks in lower numbered lists
numbered lists is updated, if it contains cold data, it is stored Garbage collect blocks in list numbers min wear and
in a new block from higher numbered lists. Since these blocks min wear + (𝜏 − 1)
are block mapped, the updates need to be done in log blocks. end if
To achieve this, we follow the scheme adopted in [26]. A log
block can be associated with any data block. Any updates to
the data block go to the log block. The data blocks and the D. Static Wear Leveling
log block are merged during garbage collection. This scheme Static wear leveling moves cold data from blocks with low
is called Fully Associative Sector Translation [26]. Note that erase counts to blocks with more erase counts. This frees up
this scheme is used only for data blocks storing cold data that least worn blocks which can be used to store hot data. This
have very minimum updates. Thus the number of log blocks also spreads the wearing of blocks evenly. Rejuvenator does
required is small. One potential drawback of this scheme is that this in a well controlled manner and only when necessary. The
since log blocks contain cold data, most of them remain valid. cold data migration is generally done by swapping the cold
So during garbage collection, there may be many expensive data of a block (with low erase count) with another block with
full merge operations where valid pages from the log block high erase count [16], [11]. In Rejuvenator this is done more
and the data block associated with the log block need to be systematically.
copied to a new clean block and then the data blocks and log The operation of the Rejuvenator algorithm could be visu-
block are erased. However in our garbage collection scheme as alized by a moving window where the window size is 𝜏 as
explained later, the higher numbered lists are garbage collected in Figure 1. As the value of min wear increases by 1, the
only after the lower numbered lists. Hence the frequency of window slides down and thus allows the value of max wear
these full merge operations is very low. Even if otherwise, to increase by 1. As the window moves, its movement could
these full merges are unavoidable tradeoffs with block level be restricted on both ends - upper and lower. The blocks in the
mapping. When the update is to a page in the higher numbered list number min wear + (𝜏 −1) can be used for new writes but
lists and the page is identified as hot, we simply invalidate cannot be erased since the window size will increase beyond
the page and map it to a new page in the lower numbered 𝜏.
lists. The block association of the current block to which the The window movement is restricted in the lower end be-
page belongs is unaltered. As explained before this is to avoid cause the value of min wear either does not increase any fur-
remapping other pages in the block that are cold. ther or increases very slowly. This is due to the accumulation
of cold data in the blocks in the lower numbered lists. In other increases, the value of life diff decreases linearly and so does
words the cold data has become stale/static in the blocks in the value of 𝜏 . Figure 2 illustrates the decreasing trend of the
the lower numbered lists. This condition is detected when the value of 𝜏 in the linear scheme.
number of clean blocks in the lower numbered lists is below a 2) Non-Linear Decrease: The linear decrease uniformly
threshold. This is considered as an indication that cold data is reduces the value of 𝜏 by 𝑟% everytime a decrease is triggered.
remaining stale at the blocks in list number min wear and so Instead if a still more efficient control is needed, the value of
they are moved to blocks in higher numbered lists. The blocks 𝜏 should be done in a non - linear manner i.e., the decrease
in list number min wear are cleaned. This makes these blocks in 𝜏 has to be slower in the beginning and get steeper towards
available for storing hot data and at the same time increasing the end. Figure 3 illustrates our scheme. We choose a curve
the value of min wear by 1. This makes room for garbage as in Figure 3 and set the value of the slope of the curve
collecting in the list number min wear + (𝜏 − 1) and hence corresponding to the value of life diff as 𝜏 . We can see that
makes more clean blocks available for cold data as well. the rate of decrease in 𝜏 is much steeper towards the end of
The movement of the window could also be restricted at the lifetime.
higher end. This happens when there are a lot of invalid blocks
in the max wear list and they are not garbage collected. If no F. Adapting the parameter 𝑚
clean blocks are found in the higher numbered lists it is an The value of 𝑚 determines the ratio of blocks storing hot
indication that there are invalid blocks in list number min wear data to the blocks storing cold data. Initially the value of 𝑚 is
+ (𝜏 − 1) and they cannot be garbage collected since the value set to 50% of 𝜏 and then according to the workload pattern,
of diff would exceed the threshold. This condition happens the value of 𝑚 is incremented or decremented. Whenever the
when the number of blocks storing cold data is insufficient. In window movement is restricted at the lower end, the value of
order to enable smooth movement of the window, the value of 𝑚 is incremented by 1 following the stale cold data migrations.
min wear has to increase by 1. The blocks in list min wear This makes more blocks available to store hot data. Similarly,
may still have hot data since the movement of the window is whenever the window movement is restricted at the higher
restricted at the higher end only. Hence data in all these blocks end, the value of 𝑚 is decremented by 1 so that there are more
are moved to blocks in lower numbered lists itself. However blocks available for cold data. This adjustment of 𝑚 helps to
this condition does not happen frequently since before this further reduce the data migrations. Whenever the value of 𝑚 is
condition is triggered, the blocks storing hot data are updated incremented or decremented, the type of mapping (block - level
faster and the value of min wear increases by 1. Rejuvenator or page - level) of the blocks in the list number min wear +
takes care of the fact that some data which is hot may turn (𝑚 − 1) is not changed immediately. The mapping is changed
cold at some point of time and vice versa. If data that is cold to the relevant type only for write requests after the increment
is turning hot then it would be immediately moved to one of or decrement. This causes a few blocks in the lower numbered
the blocks in lower numbered lists. Similarly cold data would lists to be block mapped. But this is taken care of during the
be moved to more worn blocks by the algorithm. Hence the static wear leveling and garbage collection operations.
performance of the algorithm is not seriously affected by the
accuracy of the hot - cold data identification mechanism. As IV. E VALUATION
the window has to keep moving, data is migrated to and from This section discusses the overheads involved with the
blocks according to its degree of hotness. This migration is implementation of Rejuvenator analytically and evaluates the
done only when necessary rather than forcing the movement performance of Rejuvenator via detailed experiments.
of stale cold data. Hence the performance overhead of these
data migrations is minimized. A. Analysis of overheads
The most significant overhead of Rejuvenator is the man-
E. Adapting the parameter 𝜏 agement of the lists of blocks. This overhead could possibly
The key aspect of Rejuvenator is that the parameter 𝜏 is manifest in terms of both space and performance. However
adjusted according to the lifetime of the blocks. We argue our implementation tries to minimize these overheads.
that this parameter value can be large at the beginning where First we analyze the memory requirements of Rejuvenator.
the blocks are much farther away from reaching their lifetime. The number of lists is at most 𝜏 . Each list contains blocks
However as the blocks are reaching their lifetime the value of with erase counts equal to the list number. We implemented
𝜏 has to decrease. Towards the end of lifetime of the flash each list as a dynamic vector numbered from 0 to 𝜏 . The free
memory, the value of 𝜏 has to be very small. To achieve this blocks are always added in the front of the vector and the
goal, we adopt two methods for decreasing the value of 𝜏 . blocks containing data are added in the back. Assuming that
1) Linear Decrease: Let the difference between 100𝐾 each block address occupies 8 bytes of memory, a 32 GB flash
(maximum number of erases that a block can endure) and memory with 4 KB pages and 64 KB blocks would require 2
max wear (maximum erase count of any block in the flash MB of additional memory. Since these maps are maintained
memory) be life diff. As the blocks are being used up, the based on erase counts, the logical to physical address mapping
value of 𝜏 is 𝑟% of life diff. For our experimental purposes tables have to be maintained separately. Rejuvenator maintains
we set the value of 𝑟 as 10%. As the value of max wear both block level and page level mapping tables. A pure page
Fig. 2. Linear decrease of 𝜏 Fig. 3. Non-linear decrease of 𝜏
level mapping table for the same 32 GB flash would require 64 the average access count of the window and any LBA that has
MB of memory. However since Rejuvenator maintains page an access count more than the average count is considered
maps only for hot LBAs and the proportion of hot LBAs is hot. The hot data algorithm accounts for both recency and
much smaller (< 10%), the memory requirement is much frequency of accesses of the LBAs. Every time the window is
smaller. For the above mentioned 32 GB flash the memory full, the counters are divided by 2 to prevent any single block
occupied by mapping tables does not exceed 3 MB. The from increasing the average.
page level mappings are also maintained for the log blocks.
However they occupy a very small portion of the entire flash B. Experiments
memory (< 3% [21]) and hence their memory requirement is This section explains in detail our experimental setup and
insignificant. the results of our simulation. We compare Rejuvenator with
Next we discuss the performance overheads of Rejuvenator. two other wear leveling algorithms - the dual pool algo-
The association of blocks with the appropriate lists and the rithm [16] and the wear leveling algorithm adopted by M
block lookups in the lists are the additional operations in - Systems in the True Flash Filing System (TrueFFS) [11].
Rejuvenator. The association of blocks to the lists is done While the TrueFFS is an industry standard, the emphasis on
during garbage collection. As soon as a block is erased, it static wear leveling is much less. On the other hand, the
is moved from its current list and associated with the next dual pool algorithm is a well known wear leveling algorithm
higher numbered list. Since garbage collection is done list by in the area of flash memory research and primarily aims at
list starting from the lower numbered lists and all the blocks achieving good static wear leveling. We believe that all other
containing the data blocks are at the back of the lists, this wear leveling algorithms either do not attempt to achieve a fine
operation takes 𝑂(1) time. The block lookups are done in grained management of the blocks or adopt a slight variation
the mapping tables. Since the hot pages are page mapped, of these two schemes and hence are not suitable candidates
the efficiency of writes is improved since there are no block for comparison with Rejuvenator.
copy operations which are typically involved with block level
TABLE I
mapping. For cold writes, the updates are buffered in the log F LASH M EMORY C HARACTERISTICS
blocks and are merged together with data blocks later during
garbage collection. The log blocks typically occupy 3% [21] Page Size Block Size Read Time Write Time Erase Time
4 KB 128 KB 25𝜇s 200𝜇s 1.5𝑚𝑠
of the entire flash region. This is to buffer writes to the entire
flash region. However in Rejuvenator the log blocks buffer
writes to only the blocks storing cold data. So the log buffer 1) Simulation Environment: The simulator that we used is
region can be much smaller. In our experiments we did not trace driven and provides a modular framework for simulating
exclusively define a log block region. We pick a free block flash based storage systems. The simulator that we have built is
with the least possible erase count in the higher numbered exclusively to study the internal characteristics of flash mem-
lists and use it as a log block. ory in detail. The various modules of flash memory design like
Hot data identification is an integral part of Rejuvenator. FTL design (right now integrated with Rejuvenator), garbage
Rejuvenator maintains an LRU window of fixed size (𝑊 ) collection and hot data identification can be independently
with the LBAs and corresponding counters for the number deployed and evaluated. We simulated a 32 GB NAND flash
of accesses. Every time the window is full, the LBA in the memory with the specifications as in Table I. However we
LRU position is evicted and the new LBA is accommodated restrict the active region of accesses to which the reads and
in the MRU position. The most frequently accessed LBAs in writes are done so that the performance of wear leveling
the window are considered hot and are page mapped. Instead can be observed in close detail. The remaining blocks do
of sorting the LBAs based on frequency count, we maintain not participate in the I/O operations. The same method has
that are done without any write requests.
To make a fair comparison we set the value of threshold for
dual pool at 16. Dual pool uses a block level mapping scheme
for all the blocks. We used the Fully Associative Sector
Translation [26] in dual pool for the block-level mapping. In
TrueFFS a virtual erase unit consists of a chain of physical
erase units. Then during garbage collection these physical
erase units are folded into one physical erase unit. We assume
that these physical erase units are in the units of blocks (128K)
and the reads and writes are done at the level of pages. Hence
TrueFFs also employs a block-level address mapping.
Figure 4 shows the number of write requests that are
Fig. 4. Number of write requests serviced before a single block reaches its serviced before a single block reaches its lifetime. Rejuvenator
lifetime (Linear) means that the value of 𝜏 is decremented linearly
and Rejuvenator (Non Linear) is the scheme where the value
of 𝜏 is decremented non-linearly. On the average Rejuvenator
been adopted in [23]. An alternate way to demonstrate the increases the lifetime of blocks by 20% compared to dual pool
performance of the wear leveling scheme is the one followed algorithm for all traces. The dual pool algorithm performs
in [15]. The authors consider the entire flash memory for much worse than Rejuvenator for the Exchange trace and
reads and writes but they assume that the maximum life Trace A. This is simply because the dual pool algorithm simply
time of every block is only 50 erase cycles. However this could not adapt to the rapidly changing workload patterns.
technique may not give an exact picture of the performance Since all the blocks have a block - level mapping, random page
of Rejuvenator because with a larger erase count limit, the writes in these traces lead to too many erase operations. The
system can have much more relaxed constraints. The main TrueFFS algorithm on the other hand consistently performs
objective of Rejuvenator is to reduce the migrations of data badly since some of the blocks reach very high erase counts
due to tight constraints on erase counts of blocks. We have much faster than other blocks.
adopted both of these techniques to evaluate the performance
of Rejuvenator. We consider a portion of the SSD as the active
region and set the maximum erase count limit for the blocks
as 2K. This way the impact of Rejuvenator on the lifetime and
performance of the flash memory can be studied in detail.
2) Workloads: We evaluated Rejuvenator with three avail-
able enterprise-scale traces and two synthetic traces. The first
trace is a write intensive I/O trace provided by the Storage
Performance Council [27] called the Financial trace. It was
collected from an OLTP application hosted at a financial
institution. The second trace is a more recent trace data that
was collected from a Microsoft Exchange Server serving 5000
mail users in Microsoft [28]. The third trace is the Cello99
Fig. 5. Overhead caused by extra block erases during wear leveling
trace from HP labs [29]. This trace was collected over a period (normalized to Rejuvenator (non-linear))
of one year from Cello server at HP labs. We replayed the
traces until a block reaches its lifetime. Even though the traces
are replayed, the behavior of the system is completely different
for two different runs of the same trace since the blocks are
becoming older.
We also generated two synthetic traces. The access pattern
of the first trace consisted of a random distribution of blocks
and the second trace had 50% of sequential writes. All the
write requests are 4𝐾𝐵 in size.
3) Performance Analysis: The typical performance metric
for a wear leveling algorithm is the number of write requests
that are serviced before a single block achieves its maximum
erase count. We call this the lifetime of the flash. Another
metric that is typically used to evaluate the performance of Fig. 6. Overhead caused by extra block copy operations during wear
wear leveling is the additional overhead that is incurred due leveling(normalized to Rejuvenator (non-linear))
to data migrations. These are the erase and copy operations
Fig. 7. Distribution of erase counts in the blocks