0% found this document useful (0 votes)
40 views6 pages

Papr 1

Uploaded by

ahsanbser67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views6 pages

Papr 1

Uploaded by

ahsanbser67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2022 IEEE Information Theory Workshop (ITW)

Code-Aware Storage Channel Modeling


via Machine Learning
Simeng Zheng and Paul H. Siegel
Electrical and Computer Engineering Dept., University of California, San Diego, La Jolla, CA 92093 U.S.A
{sizheng,psiegel}@ucsd.edu
Abstract—With the reduction in device size and the increase a generative adversarial network (GAN) [3] in a
2022 IEEE Information Theory Workshop (ITW) | 978-1-6654-8341-4/22/$31.00 ©2022 IEEE | DOI: 10.1109/ITW54588.2022.9965920

in cell bit-density, NAND flash memory suffers from larger conditional setting. This modeling approach was shown
inter-cell interference (ICI) and disturbance effects. Constrained to comprehensively learn both spatial and temporal properties
coding can mitigate the ICI effects by avoiding problematic error-
prone patterns, but designing powerful constrained codes requires of the flash channel.
a comprehensive understanding of the flash memory channel. Constrained codes [17] have been proposed to mitigate read
Recently, we proposed a modeling approach using conditional errors arising from the ICI phenomenon in flash memory by
generative networks to accurately capture the spatio-temporal forbidding the programming of error-prone patterns. Learning
characteristics of the read signals produced by arrays of flash the characteristics of the input-constrained flash channel poses
memory cells under program/erase (P/E) cycling. In this paper,
we introduce a novel machine learning framework for extending a challenge, however. Statistical models have not been used
the generative modeling approach to the coded storage channel. to explore the subtle characteristics of the channel associated
To reduce the experimental overhead associated with collecting with the use of constrained data. GFM has the potential to
extensive measurements from constrained program/read data, model the input-constrained channel, but a model trained from
we train the generative models via transferring knowledge from pseudo-random data does not provide sufficient knowledge
models pre-trained with pseudo-random data. This technique can
accelerate the training process and improve model accuracy in about the constrained channel. On the other hand, acquiring
reconstructing the read voltages induced by constrained input a large dataset of measurements from constrained data can
data throughout the flash memory lifetime. We analyze the consume excessive amounts of time and hardware resources.
quality of the model by comparing flash page bit error rates It has been observed that learned knowledge from models
(BERs) derived from the generated and measured read voltage pre-trained on a large dataset (e.g., ImageNet [2]) can
distributions. We envision that this machine learning framework
will serve as a valuable tool in flash memory channel modeling effectively be applied to other tasks, either by extracting
to aid the design of stronger and more efficient coding schemes. off-the-shelf features from trained networks [21], [27], or
Index Terms—Data storage systems, Machine learning, by adapting learned knowledge to a new domain [18].
Constrained coding, Inter-cell interference. Moreover, transferring learning has shown success in the
I. I NTRODUCTION context of generative models, e.g., in applications to image
generation [25]. To accurately model the input-constrained
Recently, there has been great interest in the application flash channel, we therefore propose a transfer learning
of machine learning in communications and networking, approach, whereby the GFM network is first trained on a
including data storage. For example, robust signal detection large dataset of measurements from pseudo-random data, and
in magnetic recording channels using a recurrent neural then is fine-tuned by re-training on a much smaller dataset of
network (RNN) architecture was demonstrated in [29]. A measurements from constrained data. We refer to this as code-
low-density parity-check (LDPC) decoder with flexible code aware GFM.
lengths and column weights exploiting RNN was proposed The paper has the following contributions:
in [26]. Machine learning was also applied to page failure 1) We propose a novel framework for code-aware
prediction [14], [22] and, in a limited setting, read voltage generative channel modeling, where the voltage levels
generation [15] in NAND flash memory. The synergy between of coded program levels can be precisely and rapidly
machine learning and data storage is stimulating important reconstructed.
mutual progress. 2) We show how generative models trained on pseudo-
Realistic models for storage and communication channels random programming data can efficiently transfer
are critical tools in the design of signal processing and coding knowledge to other coded-channel modeling tasks where
methods. Generative flash modeling (GFM) [28] was recently code-specific data is limited.
proposed to model the complex spatio-temporal characteristics 3) We demonstrate the quality of reconstruction in code-
of read voltages in flash memory channels. Although statistical aware GFM by analysis of voltage distributions and bit
models [7], [10], [16], [19] to characterize the effects of P/E error rates (BERs).
cycling, ICI, and retention on flash memory read voltages have
II. NAND F LASH M EMORY AND ICI M ITIGATION
been proposed, their predictions have not been validated in the
literature by comparing to measurements of pattern-dependent A. NAND Flash Memory Basics
errors as a function of spatial and temporal factors. NAND flash memory stores data as voltages in floating gate
GFM uses a conditional VAE-GAN [9] architecture, transistors, called cells. In a flash chip, cells are organized
combining a variational auto-encoder (VAE) [8] and in two-dimensional (2-D) arrays, called blocks, consisting

978-1-6654-8341-4/22/$31.00 ©2022 IEEE 196


Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 09,2023 at 03:12:55 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE Information Theory Workshop (ITW)

TABLE I
N UMERICAL VALUES OF PATTERN -D EPENDENT E RROR R ATES FOR THE M OST S EVERE ICI PATTERNS
2 3 2 3 2 3 2 3 2 3 2 3
⇥ ⇤ ⇥ ⇤ ⇥ ⇤ 7 7 6 7 7 7
Error rate 7 0 7 7 0 6 6 0 7 40 5 40 5 40 5 47 0 75 47 0 65 47 0 75
7 6 7 7 7 6
4000 P/E 2.45% 10.97% 7.42% 7.36% 15.42% 10.76% 8.78% 48.06% 35.64% 36.59%
7000 P/E 4.06% 14.45% 10.35% 10.15% 20.48% 15.01% 12.81% 50.34% 42.15% 37.39%
10000 P/E 5.84% 18.42% 13.31% 13.46% 25.73% 19.35% 17.11% 54.22% 44.42% 44.67%

4.5 (resp., 6.3) larger than the average error rate. If we program
PL0 PL1 PL2 PL3 PL4 PL5 PL6 PL7
707 in both directions, the error rate is a factor of 19.6 larger
Probability Density

than the average error rate. Second, P/E cycling causes error
rates to increase. Specifically, the average error rate increases
111 110 100 101 001 000 010 011
by a factor of 2.38 from 4000 P/E cycles to 10000 P/E cycles.
For dominant 707 error patterns in WLs (resp., BLs), the error
VL
rate increases by a factor of 1.68 (resp., 1.67) from 4000 P/E
Fig. 1. Voltage distributions and a recursive alternate Gray mapping (RAGM) cycles to 10000 P/E cycles.
between cell program levels and binary logic values of a TLC NAND flash Solid-state drives (SSDs) employ powerful error-correction
memory.
codes (ECCs) [6] within their controllers to cope with such
of horizontal wordlines (WLs) and vertical bitlines (BLs). errors. Constrained codes to further reduce ICI-induced errors
Multilevel flash memories store multiple bits per cell. For have been proposed and some have been experimentally
example, a triple-level cell (TLC) memory stores three bits validated [4], [5], [20], [24]. In particular, read-and-run (RR)
using 23 =8 possible voltage levels. Within each block, the constrained coding techniques [5] efficiently eliminate selected
three bits stored in cells along a WL are logically grouped detrimental patterns by coding on only one page per WL.
into three pages, called the lower, middle, and upper page, They allow random page access and are compatible with page-
respectively. based ECCs. A generative model that accurately learns input-
There are three basic operations on a flash device: program constrained channels will be a valuable tool in optimizing the
(write), read, and erase. We denote the program level as combination of constrained coding and ECC.
PL and the read voltage level as VL. Fig. 1 illustrates the
the conditional probability density functions (PDFs) of read III. C ODE - AWARE S TORAGE C HANNEL M ODELING
voltages for 8 program levels, each corresponding to a 3- The GFM scheme in [28] learns an approximation to
bit string of lower, middle, and upper bits. The dash-dotted the intractable likelihood P (VL|PL, P/E) from a dataset of
vertical lines represent the read thresholds used to recover measured voltage arrays VL produced by pseudo-random
the stored data. Level errors and bit errors occur when, for (unconstrained) program arrays PL. The goal of code-
example, PL=0 induces a read voltage VL lying above the aware channel modeling is to infer the intractable likelihood
first threshold and below the second threshold, causing the P (VLS |PLS , P/E), where VLS is the voltage array produced
level to be mistakenly detected as 1 and the the upper bit to by the code-constrained program array PLS . In this section,
be mistakenly detected as 0. we describe our transfer learning approach to achieving this
B. ICI Mitigation via Constrained Coding goal.

ICI effects, caused by parasitic capacitive coupling between A. Review of Generative Flash Modeling
flash cells, is one of the major obstacles to accurate The conditional VAE-GAN architecture underlying the GFM
programming and reading of a flash device [1]. Severe ICI scheme consists of three modules: encoder (Enc), generator
arises when three consecutive cells in WL or BL directions (Gen), and discriminator (Dis).
are programmed to high-low-high levels. During the training process, the encoder Enc produces latent
Error rates for the most severe ICI patterns in a commercial vectors z from VL based on the VAE technique [8]. The
TLC flash device are shown in Table I. Using (i, j) to denote generator Gen reconstructs an array of read voltages, VL, f
the (WL,BL) position of a cell in the block and Vth(01) to based on PL, P/E, and z. The P/E vectors are concatenated with
denote the threshold between PL=0 and PL=1, the table gives the output features of Gen for spatio-temporal combination.
the overall level error rate for PL=0, and the error rates for The discriminator Dis is trained to distinguish real VL from
worst-case WL, BL, and 2-D patterns, or, mathematically, fake VL.
P (VL(i,j) > Vth(01) |PL(i,j) =0); After optimization, the learned Gen serves as a realistic flash
channel simulator which accepts program level array PL, P/E
P (VL(i,j) > Vth(01) |PL(i,j 1) , PL(i,j) =0, PL(i,j+1) );
cycle count, and latent vector z as inputs. The latent vector is
P (VL(i,j) > Vth(01) |PL(i 1,j) , PL(i,j) =0, PL(i+1,j) ); sampled from a standard multivariate Gaussian distribution. We
P (VL(i,j) > Vth(01) |PL(i±1,j) , PL(i,j) =0, PL(i,j±1) ). express the reconstruction of VL in the training and evaluation
processes, respectively, as
We make two observations from Table I. First, ICI
significantly increases error rates. At 4000 P/E cycles, the error f = Gen(PL, P/E, Enc(VL))
(Train) VL
rate of 707 pattern in WL (resp., BL) direction is a factor of (Evaluation) VLf = Gen(PL, P/E, z).

197
Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 09,2023 at 03:12:55 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE Information Theory Workshop (ITW)

TABLE II
Source Dataset: r andom PL
S IZES OF T RAINING AND E VALUATION DATASETS
Enc
Training Evaluation
Gen Dis Real / Fake |{(PL, VL, P/E)}| 1.5 ⇥ 105 2.1 ⇥ 104
|{(PLSWL , VLSWL , P/E)}| 1.5 ⇥ 104 1.5 ⇥ 104
Tr ansfer |{(PLS2D , VLS2D , P/E)}| 1.5 ⇥ 104 1.5 ⇥ 104
Par ameter s

Enc
cell wear due to programming by optimally “shaping” the
Gen Dis Real / Fake
probability distribution of the programmed cell levels.
The two constrained datasets are collected from a single
Tar get Dataset: constr ained PL
commercial 1X-nm TLC chip belonging to the same family
Fig. 2. Pipeline of code-aware generative flash modeling. of chips used for the GFM experiments in [28]. Due to the
variation of mappings between manufacturers and product
Full details about the training, evaluation, and experimental
generations, we describe the disallowed patterns of the code-
results are given in [28].
constrained data in terms of the mapping in Fig. 1. The
The GFM approach was demonstrated to accurately
first target dataset uses a code constraint SWL that forbids
reconstruct cell voltage levels by capturing spatial ICI effects
{000, 010} in the lower page of each WL. This eliminates
and temporal distortions from P/E cycling, as validated by
error-prone patterns containing 707, 706, and 607 in the WL
comparing predicted time-dependent and pattern-dependent
direction, as well as other high-low-high error-prone patterns.
errors to error measurements.
The second target dataset uses a code constraint S2D that
B. Code-aware Generative Flash Modeling forbids {000, 010} in lower bits along both WL and BL
directions. This eliminates error-prone patterns containing 707,
The GFM framework is capable of learning the 706, and 607 in both WLs and BLs, including all patterns
likelihood P (VLS |PLS , P/E) from a sufficiently large shown in Table I, as well as other patterns.
dataset {(PLS , VLS , P/E)} of code-constrained programming We implement the WL-based constraint SWL with an
measurements at each P/E cycle. To avoid the expense interleaved, rate 12:18 run-length limited (RLL) (d, k) = (0, 1)
of producing such a large dataset, we propose to use a code of overall block length 36 on the lower page, yielding
transfer learning approach. We pre-train the GFM network an effective rate of 0.89 [5], [23]. The 2D-constraint is
on a large-scale source dataset {(PL, VL, P/E)} of VL implemented with the 2D RR scheme in [5], [23], which has
measurements from pseudo-random (unconstrained) program an effective rate of 0.83.
arrays PL, then fine-tune it using a much smaller target dataset We collect equal numbers of measured voltages at three
{(PLS , VLS , P/E)} of code-constrained measurements. P/E cycle counts: 4000, 7000, and 10000. The training and
We now formulate the pipeline of code-aware GFM. As evaluation dataset sizes used in our modeling experiments,
shown in Fig. 2, at the beginning of training, three network described in the next section, are shown in Table II. Note that
modules in GFM (Enc, Gen, and Dis) are initialized with the size of the target datasets is only 10% of the size of the
pre-trained weights learned from source dataset. Using the source dataset.
target dataset, the code-aware GFM follows the framework of
Remark 1. In all transfer learning experiments, we use the
GFM to finish the training process. After training, the network
same settings as were used to train the GFM, namely, batch
parameters in Gen represent the simulator to produce voltage
size 2 and learning rate 2 ⇥ 10 4 . We settled upon these
levels from code-constrained PL arrays.
training parameters after several experiments.
We note that the relation between source and target datasets
can impact the transfer learning results. In our case, because IV. E XPERIMENTAL R ESULTS AND A NALYSIS
random programming arrays very likely include constrained In this section, we evaluate our code-aware GFM framework
sub-arrays, sharing the pre-trained network weights during the and present results of its application to the two RR constrained
fine-tuning step enables the transfer of relevant knowledge. codes described in the previous section. We use two evaluation
criteria, one to measure the accuracy of the reconstructed
C. Transfer Learning Configuration results, and the other to measure the training efficiency of the
It has been observed that pre-training for all network transfer learning procedure. The former is based on probability
modules can provide better results than pre-training for one density functions (PDFs) of the reconstructed voltages, and the
individual module [25]. Therefore, in our transfer learning latter is based on the number of training iterations required
configuration, we share the parameters of all three modules to achieve accurate reconstruction. The evaluation metrics are
of the pre-trained network. defined in more detail below.
In our experiments, we consider two read-and-run (RR) 1) Probability density functions (PDFs): The read voltage
constrained codes [5]. The corresponding target datasets PDFs are useful in optimizing read thresholds, gauging
{(PLS , VLS , P/E)} consist of pairs of 64 ⇥ 64 PL and VL cell wear, and estimating bit error rates (BERs). For
arrays, collected from a commercial TLC flash device at each P/E cycle, we estimate the conditional PDFs by
selected P/E cycles, as in the original GFM setup in [28]. the frequency of occurrence of measured voltage levels
This framework is also applicable to data shaping codes for for each given program level. In addition to visually
flash memory [11]–[13]. These codes minimize the average comparing the measured PDFs and reconstructed PDFs,

198
Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 09,2023 at 03:12:55 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE Information Theory Workshop (ITW)

TABLE III TABLE IV


M ODELING E XPERIMENTS AND T RAINING I TERATIONS T OTAL VARIATION D ISTANCE
Training P/E Cycle Count 4000 7000 10000
Initialization (I) Training (T) Evaluation (E)
Iterations dT V (PM-PR , PI-Rnd/T-PR/E-PR ) 0.0688 0.0650 0.0687
Random PR PR,WL,2D 5.25 ⇥ 105 dT V (PM-WL , PI-Pre/T-WL/E-WL ) 0.0696 0.0535 0.0505
WL WL 6 ⇥ 104 dT V (PM-WL , PI-Rnd/T-WL/E-WL ) 0.1421 0.1300 0.1020
2D 2D 6.75 ⇥ 104 dT V (PM-WL , PI-Rnd/T-PR/E-WL ) 0.1068 0.1116 0.1181
Pre-trained WL WL 7.5 ⇥ 103 dT V (PM-2D , PI-Pre/T-2D/E-2D ) 0.1007 0.0771 0.0908
2D 2D 7.5 ⇥ 103 dT V (PM-2D , PI-Rnd/T-2D/E-2D ) 0.1175 0.1021 0.1408
dT V (PM-2D , PI-Rnd/T-PR/E-2D ) 0.1470 0.1330 0.1364
we compute the total variation distance between the two
PDFs and compare the associated bit error rates (BERs)
on the lower, middle, and upper pages.
2) Training iterations: The number of training iterations
needed to achieve satisfactory results can be used as a
metric to evaluate the “speed” of the transfer learning
process. A training iteration is defined as a single update
of the model weights during training. For example,
in [28], training the GFM network takes 7 epochs with a
batch size of 2 using random programming arrays. With
the training dataset size in Table II, the total number of
training iterations is 5.25 ⇥ 105 .
A. Experimental Settings
Fig. 3. PDF plots in logarithmic scale for measured and regenerated voltage
We conducted a matrix of experiments to evaluate the levels (experiment I-Pre/T-WL/E-WL) at 7000 P/E cycles. The visualization
effectiveness of transfer learning in code-aware GFM, as is based on dataset {(PLSWL , VLSWL , P/E)}.
summarized in Table III. (See discussion below for an
5) I-Pre / T-2D / E-2D: We initialize GFM with pre-trained
explanation of the abbreviations in the table.) The training
weights from the first training experiment (I-Rnd/T-PR),
iterations are shown in the last column of the table. For
fine-tune the network using the 2D-constrained dataset,
convenience, we use a shorthand notation to distinguish the
and evaluate the model with 2D-constrained data.
experiments according to the training dataset ("T"), the network
6) I-Rnd / T-2D / E-2D: We train GFM with random
initialization ("I"), and the evaluation dataset ("E").
initial network weights using the 2D-constrained training
The training dataset corresponded to program arrays based
dataset and evaluate with 2D-constrained data.
on either pseudo-random data ("T-PR"), SWL -constrained data
("T-WL"), or S2D -constrained data ("T-2D"). Regarding the B. PDF Analysis
training mode, training started either from randomly initialized We now qualitatively and quantitatively analyze the
network weights ("I-Rnd") or pre-trained weights ("I-Pre") reconstructed voltages from code-aware GFM. First, we
from T-PR training. visualize the PDFs of the measured and reconstructed read
The evaluation mode examined reconstructed voltages voltages. Fig. 3 shows the normalized conditional PDFs of
generated by pseudo-random data ("E-PR"), SWL -constrained the eight TLC program levels in the reconstructed data for
data ("E-WL"), or S2D -constrained data ("E-2D"). experiment I-Pre/T-WL/E-WL at 7000 P/E cycles. (The plots of
Comparisons are made to measurements ("M") from the voltage PDFs for this experiment at 4000 and 10000 P/E cycles
TLC chip, derived from the pseudo-random dataset ("M-PR"), yield qualitatively similar results.) In this log-linear plot, the y-
the SWL -constrained dataset ("M-WL"), or the S2D -constrained axis represents the probability density and the x-axis represents
dataset ("M-2D"). the read voltages using an arbitrary scale.
We present results for the following experiments, Note that the SWL code constraint on lower pages induces
1) M-PR, M-WL, M-2D: These represent baseline a smaller probability of occurrence for PLs 5, 6, 7, which is
experimental measurements from several 1X-nm approximately 13 of that of PLs 1, 2, 3, 4. Qualitatively, the
flash blocks programmed with pseudo-random, WL- PDFs generated by code-aware GFM (solid curves) closely
constrained, or 2D-constrained data. match the measured PDFs (triangle markers). Similarly, in
2) I-Rnd / T-PR / E-(PR,WL,2D): We train GFM with experiment I-Pre/T-2D/E-2D, the visualization of the model-
random initial network weights using the pseudo-random generated PDFs accurately reflects the measured PDFs and
training dataset and evaluate with pseudo-random data, their dependence on P/E cycles.
WL-constrained data, and 2D-constrained data. Next, we evaluate the PDF results of the code-aware GFM
3) I-Pre / T-WL / E-WL: We initialize GFM with pre- experiments quantitatively using total variation (TV) distance,
trained weights from the previous training experiment (I- dTV . This distance provides a measure of the difference
Rnd/T-PR), fine-tune the network using WL-constrained between the real (measured) distributions Preal and the fake
data, and evaluate the model with WL-constrained data. (reconstructed) distributions Pf ake ,
4) I-Rnd / T-WL / E-WL: We train GFM with random
1X
initial network weights using the WL-constrained dTV (Preal , Pf ake ) = |Preal (VL) Pf ake (VL)|.
training dataset and evaluate with WL-constrained data. 2 VL

199
Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 09,2023 at 03:12:55 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE Information Theory Workshop (ITW)

Fig. 4. BER comparisons: the leftmost (resp., rightmost) three sub-figures show lower, middle, and upper page BERs for SWL -coded (resp., S2D -coded) data.

The numerical results are shown in Table IV. We find that pre- although both models overestimate BER in lower pages at all
training helps code-aware GFM produce distributions with the P/E cycles, as well as in middle pages at 4000 P/E cycles.
least TV distance in both SWL -coded and S2D -coded scenarios. The rightmost three sub-figures in Fig. 4 show the
It is also important to consider the tails of the distributions, corresponding BER results for lower, middle, and upper pages
which have a major impact on the channel error rate. As in the S SD -coded case, respectively. (The BERs of I-Rnd/T-
discussed in Section II-A, cell level errors are determined by PR/E-2D for lower and middle pages are at least 3 ⇥ 10 2 ;
comparing the read voltages to the read thresholds, and the thus, the curves are not shown in the sub-figures.) The overall
resulting bit errors on pages arise from the mapping between conclusions drawn from these curves are similar to the SWL -
cell levels and their corresponding 3-bit binary logic values. coded case, although we see that the GFM trained on the
We compared measured and reconstructed page bit error rates pseudo-random source dataset does an even worse job of
(BERs) from the ten experiments described in Section IV-A. learning the S2D -coded channel.
The results are shown in Fig. 4. C. Iteration Number Analysis
The leftmost three sub-figures in Fig. 4 pertain to the The number of training iterations used in the experiments
lower, middle, and upper pages in the SWL -coded case, was determined by comparing the reconstructed PDFs to the
respectively. The six curves in each plot correspond to corresponding measured PDFs using TV distance.
the experimental measurements M-PR and M-WL, the From Table III, we find that the number of iterations required
GFM modeling experiments I-Rnd/T-PR/E-PR and I-Rnd/T- to fine-tune the code-aware GFM network from the pre-trained
PR/E-WL, and the code-aware GFM experiments I-Rnd/T- model, 7.5 ⇥ 103 , is only 12.5% (resp., 11.11%) of the number
WL/E-WL and I-Pre/T-WL/E-WL using the SWL dataset, required when training from scratch using the target SWL (resp.,
comparing training “from scratch” and with pre-trained SSD ) dataset, namely 6 ⇥ 104 (resp., 6.75 ⇥ 104 ).
network parameters. Specifically, when training from scratch using the smaller
The M-PR and M-WL curves show that SWL coding target dataset, we observed that in the early training iterations
decreases the measured BER on all three pages at all three the reconstructed read voltage PDFs do not accurately capture
measured P/E cycles, confirming the observations in [5]. The temporal P/E cycle variations and tail behavior. On the other
GFM experiment I-Rnd/T-PR/E-PR using random initialization hand, adaptation from a single GFM network pre-trained with
along with training and evaluation on pseudo-random data a sufficiently large source dataset of pseudo-random data
reconstructs page BERs quite accurately at all three P/E cycles, provides enough channel knowledge to significantly accelerate
a finding that is consistent with [28]. However, when this the learning process from both of the smaller target datasets.
GFM network is evaluated using the SWL -coded dataset in V. C ONCLUSION
experiment I-Rnd/T-PR/E-WL, we see that the reconstructed This paper presents an application of transfer learning
BERs are significantly higher than the measured BERs in the to generative modeling of read voltages in flash memory
M-WL curve at all three P/E cycles. This suggests that the channels. We fine-tune a generative model pre-trained with
pseudo-random dataset does not sufficiently capture all of the a large source dataset of pseudo-random spatio-temporal
characteristics of the coded channel. data using much smaller code-constrained target datasets. By
The final two curves compare the effects of random comparing measured and reconstructed read voltage probability
initialization and pre-training in the code-aware GFM networks distribution functions and page bit error rates in a commercial
obtained by training and evaluating on the SWL dataset. We TLC flash memory, we demonstrate that pre-training can
see that the two experiments yield very similar reconstructed accelerate learning for multiple generative modeling tasks even
BERs, with the exception of the lower page BER at 4000 when the amount of target training data is very limited. These
P/E cycles, where random initialization yields a noticeably results motivate further investigation into the use of transfer
more inaccurate estimate. Overall, the reconstructed BERs learning in applications of machine learning to data storage
qualitatively track the measured M-WL results reasonably well, and communication systems.

200
Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 09,2023 at 03:12:55 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE Information Theory Workshop (ITW)

R EFERENCES [24] V. Taranalli, H. Uchikawa, and P. H. Siegel, “Error analysis and inter-cell
interference mitigation in multi-level cell flash memories,” in Proc. IEEE
[1] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Int. Conf. Commun. (ICC), London, U.K., June 2015, pp. 271–276.
characterization, mitigation, and recovery in flash-memory-based solid- [25] Y. Wang, C. Wu, L. Herranz, J. van de Weijer, A. Gonzalez-Garcia,
state drives,” Proc. of the IEEE, vol. 105, no. 9, pp. 1666–1704, Sep. and B. Raducanu, “Transferring GANs: generating images from limited
2017. data,” in Proc. European Conf. Comput. Vis. (ECCV), Munich, Germany,
[2] J. Deng, W. Dong, R. Socher, L-J. Li, K. Li, and F-F. Li, “Imagenet: A Sep. 2018.
large-scale hierarchical image database," in Proc. IEEE Conf. Comput. [26] X. Xiao, B. Vasić, R. Tandon, and S. Lin, “Designing finite alphabet
Vis. Pattern Recognit. (CVPR), Miami, FL, USA, June, 2009. iterative decoders of LDPC codes via recurrent quantized neural
[3] I. J. Goodfellow, J. P.-Abadie, M. Mirza, B. Xu, D. W.-Farley, S. Ozair, networks,” IEEE Trans. Commun., vol. 68, no. 7, pp. 3963–3974, Apr.
A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. 2020.
Neural Inf. Process. Syst. (NIPS), Montréal, Canada, Dec. 2014, pp. [27] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferrable are
2672–2680. features in deep neural networks?” in Proc. Neural Inf. Process. Syst.
[4] A. Hareedy, B. Dabak, and R. Calderbank, “Managing device lifecycle: (NIPS), Montréal, Canada, Dec. 2014.
reconfigurable constrained codes for M/T/Q/P-LC flash memories,” IEEE [28] S. Zheng, C-H. Ho, W. Peng, P. H. Siegel, “Spatio-temporal modeling
Trans. Inf. Theory, vol. 67, no. 1, pp. 282–295, Oct. 2020. for flash memory channels using conditional generative nets,” May 2022,
[5] A. Hareedy, S. Zheng, P. H. Siegel, and R. Calderbank, “Read-and-run arXiv: 2111.10039.
constrained coding for modern flash devices,” in Proc. IEEE Int. Conf. [29] S. Zheng, Y. Liu, and P. H. Siegel, “PR-NN: RNN-based detection for
Commun. (ICC), Seoul, South Korea, May 2022. coded partial-response channels,” IEEE J. Select. Areas Commun., vol.
[6] P. Huang, Y. Liu, X. Zhang, P. H. Siegel, E. F. Haratsch, “Syndrome- 39, no. 7, pp. 1967–1982, July 2021.
coupled rate-compatible error-correcting codes: theory and application,”
IEEE Trans. Inf. Theory, vol. 66, no. 4, pp. 2311–2330, Jan. 2020.
[7] Y. Kim and B. V. K. Vijaya Kumar, “Writing on dirty flash memory,”
in Proc. 52nd Annu. Allerton Conf. Commun., Control, Comput.,
Monticello, IL, USA, Oct. 2014, pp. 513–520.
[8] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
Proc. Int. Conf. Represent. Learn. (ICLR), Banff, Canada, Apr. 2014.
[9] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, O. Winther,
“Autoencoding beyond pixels using a learned similarity metric,” in Int.
Conf. Mach. Learn. (ICML), New York, NY, USA, June 2016.
[10] Q. Li, A. Jiang, and E. F. Haratsch, “Noise modeling and capacity
analysis for NAND flash memories,” in Proc. IEEE Int. Symp. Inf. Theory
(ISIT), Honolulu, HI, USA, June. 2014, pp. 2262–2266.
[11] Y. Liu, P. Huang, A. W. Bergman, and P. H. Siegel, “Rate-constrained
shaping codes for structured sources,” IEEE Trans. Inf. Theory, vol. 66,
no. 8, pp. 5261–5281, Aug. 2020.
[12] Y. Liu, Y. Li, P. Huang, and P. H. Siegel, “Rate-constrained shaping
codes for finite-state channels with cost,” in Proc. IEEE Int. Symp. Inf.
Theory, Espoo, Finland, Jun. 26-Jul. 1, 2022, to be published.
[13] Y. Liu and P. H. Siegel, “Shaping codes for structured data,” in Proc.
IEEE Global Commun. Conf., Washington, DC, USA, Dec. 2016.
[14] Y. Liu, S. Wu, and P. H. Siegel, “Bad Page Detector for NAND Flash
Memory,” in Non-Volatile Memories Workshop (NVMW), La Jolla, CA,
USA, Mar. 2020.
[15] Z. Liu, Y. Liu, and P. H. Siegel, “Generative modeling of NAND flash
memory voltage level,” in Non-Volatile Memories Workshop (NVMW),
La Jolla, CA, USA, Mar. 2021.
[16] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch and O. Mutlu, “Enabling accurate
and practical online flash channel modeling for modern MLC NAND
flash memory,” IEEE J. Select. Areas Commun., vol. 34, no. 9, pp. 2294-
2311, Sept. 2016.
[17] B. H. Marcus, R. M. Roth, and P. H. Siegel. (Oct. 2001). An
Introduction to Coding for Constrained Systems. [Online]. Available:
https://personal.math.ubc.ca/ marcus/Handbook/.
[18] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring
mid-level image representations using convolutional neural networks,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Columbus,
OH, USA, June, 2014.
[19] T. Parnell, N. Papandreou, T. Mittelholzer, and H. Pozidis, “Modelling of
the threshold voltage distributions of sub-20nm NAND flash memory,”
in Proc. IEEE Global Commun. Conf., Austin, TX, USA, Dec. 2014, pp.
2351–2356.
[20] M. Qin, E. Yaakobi, and P. H. Siegel, “Constrained codes that mitigate
inter-cell interference in read/write cycles for flash memories,” IEEE J.
Select. Areas Commun., vol.32, no.5, pp. 836–846, May 2014.
[21] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features
off-the-shelf: an astounding baseline for recognition,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. Workshop, Columbus, OH, USA,
June 2014.
[22] N. Sree Prem, “An Application of Machine Learning to Bad Page
Prediction in Multilevel Flash,” Master’s Thesis, University of California
San Diego, 2019.
[23] P. H. Siegel, “Constrained Codes for Multilevel Flash Memory,”
presented at North American School of Information Theory (Padovani
Lecture), La Jolla, California, Aug. 12, 2015. Available: http://cmrr-
star.ucsd.edu/static/presentations/Padovani_Lecture_NASIT_Website.pdf.
Video: https://www.youtube.com/watch?v=FCv2PJryUr4.

201
Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 09,2023 at 03:12:55 UTC from IEEE Xplore. Restrictions apply.

You might also like