Frai 04 699148
Frai 04 699148
Deep neural networks (DNNs) are typically trained using the conventional stochastic
gradient descent (SGD) algorithm. However, SGD performs poorly when applied to
train networks on non-ideal analog hardware composed of resistive device arrays with
non-symmetric conductance modulation characteristics. Recently we proposed a new
algorithm, the Tiki-Taka algorithm, that overcomes this stringent symmetry requirement.
Here we build on top of Tiki-Taka and describe a more robust algorithm that further relaxes
other stringent hardware requirements. This more robust second version of the Tiki-Taka
algorithm (referred to as TTv2) 1. decreases the number of device conductance states
requirement from 1000s of states to only 10s of states, 2. increases the noise tolerance to
the device conductance modulations by about 100x, and 3. increases the noise tolerance
to the matrix-vector multiplication performed by the analog arrays by about 10x. Empirical
simulation results show that TTv2 can train various neural networks close to their ideal
Edited by: accuracy even at extremely noisy hardware settings. TTv2 achieves these capabilities by
Oliver Rhodes,
The University of Manchester,
complementing the original Tiki-Taka algorithm with lightweight and low computational
United Kingdom complexity digital filtering operations performed outside the analog arrays. Therefore, the
Reviewed by: implementation cost of TTv2 compared to SGD and Tiki-Taka is minimal, and it maintains
Shimeng Yu, the usual power and speed benefits of using analog hardware for training workloads. Here
Georgia Institute of Technology,
United States we also show how to extract the neural network from the analog hardware once the
Emre O. Neftci, training is complete for further model deployment. Similar to Bayesian model averaging, we
University of California, Irvine,
United States
form analog hardware compatible averages over the neural network weights derived from
TTv2 iterates. This model average then can be transferred to another analog or digital
*Correspondence:
Tayfun Gokmen
hardware with notable improvements in test accuracy, transcending the trained model
[email protected] itself. In short, we describe an end-to-end training and model extraction technique for
extremely noisy crossbar-based analog hardware that can be used to accelerate DNN
Specialty section: training workloads and match the performance of full-precision SGD.
This article was submitted to
Machine Learning and Artificial Keywords: learning algorithms, training algorithms, neural network acceleration, Bayesian neural network, in-
Intelligence, memory computing, on-chip learning, crossbar arrays, memristor
a section of the journal
Frontiers in Artificial Intelligence
Received: 22 April 2021
INTRODUCTION
Accepted: 16 August 2021
Published: 09 September 2021 Deep neural networks (DNNs) (LeCun et al., 2015) have achieved tremendous success in multiple
domains outperforming other approaches and even humans (He et al., 2015) at many problems:
Citation:
Gokmen T (2021) Enabling Training of
object recognition, video analysis, and natural language processing are only a few to mention.
Neural Networks on Noisy Hardware. However, this success was enabled mainly by scaling the DNNs and datasets to extreme sizes, and
Front. Artif. Intell. 4:699148. therefore, it came at the expense of needing immense computation power and time. For instance, the
doi: 10.3389/frai.2021.699148 amount of compute required to train a single GPT-3 model composed of 175B parameters is
tremendous: 3,600 Petaflops/s-days (Brown et al., 2020), algorithm. Whereas any non-symmetric device characteristic
equivalent to running 1,000 state-of-the-art NVIDIA A100 modifies the optimization objective and hampers the
GPUs, each delivering 150 Teraflops/s performance for about convergence of SGD based training (Gokmen and Haensch,
24 days. Hence, today’s and tomorrow’s large models are costly to 2020; Onen et al., 2021).
train both financially and environmentally on currently available Many different solutions are proposed to tackle the SGD’s
hardware (Strubell et al., 2019), begging for faster and more converge problem on crossbar arrays. First, widespread efforts to
energy-efficient solutions. engineer resistive devices with symmetric modulation
DNNs are typically trained using the conventional stochastic characteristics have been made (Fuller et al., 2019; Woo and
gradient descent (SGD) and backpropagation (BP) algorithm Yu, 2018; Grollier et al., 2020), but a mature device technology
(Rumelhart et al., 1986). During DNN training, matrix-matrix with the desired behavior remains to be seen. Second, many high-
multiplications; hence repeated multiply and add operations level mitigation techniques have been proposed to overcome the
dominate the total workload. Therefore, regardless of the device asymmetry problem. One critical issue with these
underlying technology, realizing highly optimized multiply and techniques is the serial access to cross-point elements either
add units and sustaining many of these units with appropriate one-by-one or row-by-row (Ambrogio et al., 2018; Agarwal
data paths is practically the only game everybody plays while et al., 2017; Yu et al., 2015). Serial operations such as reading
proposing new hardware for DNN training workloads (Sze et al., conductance values individually, engineering update pulses to
2017). force symmetric modulation artificially, and carrying or resetting
One approach that has been quite successful in the past few weights periodically come with a tremendous overhead for large
years is to design highly optimized digital circuits using the networks. Alternatively, there are approaches that perform the
conventional CMOS technology that leverages reduced- gradient computation outside the arrays using digital processing
precision arithmetic for the multiply and add operations. (Nandakumar et al., 2020). Note that irrespective of the DNN
These techniques are already employed to some extent by architecture, 1/3 of the whole training workload is in the gradient
current GPUs (Nvidia, 2021) and other application-specific- computation. For instance, for the GPT-3 network, 1,200
integrated-circuits (ASIC) designs, such as TPUs (Cloud TPU, Petaflops/s-days are required solely for gradient computation
2007) and IPUs (Graphcore, 2021). There are also many research throughout the training. Consequently, these approaches
efforts extending the boundaries of the reduced precision cannot deliver much more performance than the fully digital
training, using hybrid 8-bit (Sun et al., 2019) and 4-bit (Sun reduced-precision alternatives mentioned above. In short, there
et al., 2020) floating-point and 5-bit logarithmically scaled exist solutions possibly addressing the convergence issue of SGD
(Miyashita et al., 2016) number formats. on non-symmetric device arrays. However, they all defeat the
Alternative to digital CMOS, hardware architectures purpose of performing the multiply and add operations on the
composed of novel resistive cross-point device arrays have RPU device and lose the performance benefits.
been proposed that can deliver significant power and speed In contrast to all previous approaches, we recently proposed a
benefits for DNN training (Gokmen and Vlasov, 2016; new training algorithm, the so-called Tiki-Taka algorithm
Haensch et al., 2019; Burr et al., 2017; Burr et al., 2015; Yu, (Gokmen and Haensch, 2020), that performs all three cycles
2018). We refer to these cross-point devices as resistive processing (forward propagation, error backpropagation, and gradient
unit [RPU (Gokmen and Vlasov, 2016)] devices as they can computation) on the RPU arrays using the physics and
perform all the multiply and add operations needed for training converges with non-symmetric device arrays. Tiki-Taka works
by relying on physics. Out of all multiply and add operations very differently from SGD, and we showed in another study that
during training, 1/3 are performed during forward propagation, non-symmetric device behavior plays a useful role in the
1/3 are performed during error backpropagation, and finally, 1/3 convergence of Tiki-Taka (Onen et al., 2021).
are performed during gradient computation. RPU devices use Here, we build on top of Tiki-Taka and present a more robust
Ohm’s law and Kirchhoff’s law (Steinbuch, 1961) to perform the second version that relaxes other stringent hardware issues by
multiply and add needed for the forward propagation and error orders of magnitude, namely the limited number of states of RPU
backpropagation. However, more importantly, RPUs use the devices and noise. We refer to this more robust second version of
device conductance modulation and memory characteristics to the Tiki-Taka algorithm as TTv2 for the rest of the paper. In the
perform the multiply and add needed during the gradient first part of the paper, we focus on training and present TTv2
computation (Gokmen and Vlasov, 2016). algorithm details and provide simulation results at various
Unfortunately, RPU based crossbar architectures have had hardware settings. We tested TTv2 on various network
only minimal success so far. That is mainly because the training architectures, including fully connected, convolutional, and
accuracy on this imminent analog hardware strongly depends on LSTMs, although the presented results focus on the more
the cross-point elements’ conductance modulation characteristics challenging LSTM network. TTv2 shows significant
when the conventional SGD algorithm is used. One of the key improvements in the training accuracy compared to Tiki-Taka,
requirements is that these devices must symmetrically change even at much more challenging hardware settings. In the second
conductance when subjected to positive or negative pulse stimuli part of the paper, we show an analog-hardware-friendly
(Gokmen and Vlasov, 2016; Agarwal et al., 2016). Theoretically, it technique to extract the trained model from the noisy
is shown that only symmetric devices provide an unbiased hardware. We also generalize this technique and apply it over
gradient calculation and accumulation needed for the SGD TTv2 iterates and extract each weight’s time average from a
particular training period. These weight averages provide a model wij ← wij ∓ Δwmin,ij Fij wij − Δwmin,ij Gij wij (2)
that approximates the Bayesian model average, and it
outperforms the trained model itself. With this new training In Eq. 2, ∓ sign is decided by the external voltage pulse
algorithm and accurate model extraction technique, we show polarity, whereas Δwmin,ij is the incremental weight change due to
that the noisy analog hardware composed of RPU device arrays single pulse coincidence, and Fij (wij ) and Gij (wij ) are the
can provide scalable training solutions that match the performance symmetric (additive) and antisymmetric (subtractive)
of full-precision SGD. combinations of the positive and negative conductance
modulation characteristics (Gokmen and Haensch, 2020), all
PART I: Training of which are the properties of the updated device
In this section, we first give an overview of the device arrays and corresponding to the ith row and jth column. Eq. 2 is very
device update characteristics used for training. Then we present a general and governs the computation (hardware-induced update)
brief background on Tiki-Taka. Finally, we detail TTv2 and performed by the tile for all sorts of RPU device behaviors,
provide comprehensive simulation results on an LSTM assuming the device conductance modulation characteristics
network at various hardware settings. are some function of the device conductance state. If the
conductance modulations are much smaller than the whole
conductance range of operation, Eq. 3 can be derived from
Device Arrays and Conductance Eq. 2.
Modulation Characteristics
i × xj G
wij ← wij + ηδ i × xj Fij wij − ηδ ij wij (3)
Resistive crossbar array of devices performs efficient matrix-
vector multiply (y Wx) using Ohm’s law and Kirchhoff’s
In Eq. 3, xj and δi represent the input values used for updates
law. The device array’s stored conductance values form a
for each column and row, respectively corresponding to
matrix (W), whereas the input vector (x) is transmitted as
activations and errors calculated in the forward and
voltage pulses through the columns, and the resulting vector
backward cycles, and η is a scalar controlling the strength of
(y) is read as current signals from the rows. However, only
the update, all of which are inputs to pulse generation circuitry
positive conductance values are allowed physically.
at the periphery. Here, we use the stochastic pulsing scheme
Therefore, to encode both positive and negative matrix
proposed in Ref Gokmen and Vlasov (2016), and during the
elements, a pair of devices is operated in differential mode.
parallel update, the number of pulses generated by the periphery
With the help of the peripheral circuits supplying the voltage
is bounded by npulse ηmax(|δ i |)max(xj )/μΔw , where μΔw is
inputs and reading out the differential current signals, logical
the mean of Δwmin,ij for the whole tile. Using npulse stochastic
matrix elements (wij ) are mapped to physical conductance
translators generate pulses with the correct probability;
pairs as
therefore, Eq. 3 is valid in expectation. Whereas in the limit
wij Κgij − gij,ref (1) of a single pulse coincidence, the RPU response is governed
by Eq. 2.
where Κ is a global gain factor controlled by the periphery, and gij Figure 1A illustrates a pulse response of a linear and
and gij,ref are the conductance values stored at each pair symmetric device, where F(w) 1 and G(w) 0, and the
corresponding to the ith row and jth column. Moreover, hardware-induced update rule simplifies to the SGD update
crossbar arrays can be easily operated in the transpose mode rule of wij ← wij + η[δi × xj ]. In the literature, this kind of
by changing the periphery’s input and output directions. As a device behavior is usually referred to as the “ideal” device
result, a pair of arrays with the supporting peripheral circuits required for SGD. For a non-linear but symmetric device,
provide a logical matrix (also referred to as a single tile) that any F(w) deviates from unity and becomes a function of w, but
algorithm can utilize to perform a series of matrix-vector G(w) remains zero. For non-symmetric devices, G(w) also
multiplications (mat-vec) using W and W T . deviates from zero and becomes a function of w, hence
For training algorithms, the efficient update of the stored differing from the form required by SGD. Figure 1B illustrates
matrix elements is also an essential component. Therefore, device an exponentially saturating non-symmetric
device where
conductance modulation and memory characteristics are utilized wij ← wij + η[δ i × xj ] − η[δ i × xj ]w provides the computation
to implement a local and parallel update on RPU arrays. During performed by this device. Although this form of update
the update cycle, input signals are encoded as a series of voltage behavior causes convergence issues for SGD, Tiki-Taka trains
pulses and simultaneously supplied to the array’s rows and DNNs successfully with all sorts of non-symmetric devices
columns. Note that the voltage pulses are applied only to the (Gokmen and Haensch, 2020). Therefore, in contrast to SGD,
first set of RPU devices, and the reference devices are kept all sorts of non-symmetric device behaviors can be considered
constant. As a result of voltage pulse coincidence, the “ideal” for Tiki-Taka.
corresponding RPU device changes its conductance by a small Tiki-Taka’s training performance depends on the successful
amount bi-directionally, depending on the voltage polarity. This application of the symmetry point shifting technique (Kim et al.,
incremental change in device conductance results in an 2019), which guarantees G(w 0) 0 for all elements in the tile.
incremental change in the stored weight value, and the RPU This behavior is illustrated for the device in Figure 1B, where the
response is governed by Eq. 2. strengths of the positive and negative weight increments are equal
Using the generated u vector and the result of f (v), C is updated Array Model
by employing the hardware-induced parallel update (line 15). We use a device model like the one presented in Figure 1B but with
vi , if |vi | ≥ T significant array level variability and noise for the training
f (v) is a pointwise function: f (vi ) where T is
0, otherwise simulations. We simulate stochastic translators at the periphery
set to the mat-vec noise. ηc is the learning rate used for updating during the update, and each coincidence event triggers an
C. These operations are repeated for the data examples in the incremental weight change on the corresponding RPU as
training dataset for multiple epochs until a converge criteria is described below. We also introduce noise and signal bounds
met. Following the same practices described in Ref Gokmen and during the matrix-vector multiplications performed on the arrays.
Haensch (2020), here we also use the one-hot encoded u vectors During the update, the weight increments (Δw+ij ) and
and the thresholding f (v) for the LSTM simulations. decrements (Δw−ij ) are assumed to be functions of the current
weight value. For the positive branch Δw+ij Δwmin,ij (1 − slope+ij × wij )
TTv2 Algorithm and for the negative branch Δw−ij Δwmin,ij (1 + slope−ij × wij ), where
Algorithm 2 outlines the details of the TTv2 algorithm. In slope+ij and slope−ij are the slopes that control the dependence of the
addition to A and C matrices allocated on analog arrays, TTv2 weight changes on the current weight values, and Δwmin,ij is the
also allocates another matrix H in the digital domain. This matrix weight change due to a single coincidence event at the symmetry
H is used to implement a low pass filter while transferring the point. This model results in three unique parameters for each
gradient information processed by A to C. In contrast to Tiki- RPU element. All these parameters are sampled independently
Taka, TTv2 uses only the C matrix to encode the neural network’s using a unit Gaussian random variable and then used throughout
parameters, corresponding to c 0. Therefore, the activation (x) the training, providing device-to-device variability. The slopes are
and error (δ) computations are performed using C (lines 10 and obtained using slope+ij μs (1 + σ s ξ +ij ) and slope−ij μs (1 + σ s ξ −ij ),
11). TTv2 does not change the updates performed on A. After ns where μs 1.66, σ s is set to 0.1, 0.2, or 0.3 for different
updates, a mat-vec is performed on A. Unlike Tiki-Taka, TTv2 experiments, and ξ are the independent random samples. The
only uses a one-hot encoded u vector while performing a mat-vec simulation results were insensitive to σ s ; therefore, we only show
on A. This provides a noisy estimate of a single row of A, and the results corresponding to σ s 0.2. The weight increments at the
results are stored in v. After this step, the significant distinction symmetry point are obtained using Δwmin,ij μΔw (1 + σ Δw ξ ij ),
between Tiki-Taka and TTv2 appears. Instead of using u and v to where σ Δw 0.3 and μΔw is the array average varied from 0.6
update C, TTv2 first accumulates v (after scaling with ηc ) on H’s × 10−4 up to 0.15 for different experiments to study the effects
corresponding row, referred to as H(row t). During this digital number of states on training accuracy. We define the number of
vector-vector addition, the magnitude of any element in H(row t) states as the ratio of the nominal weight range to the nominal
may exceed unity. In that case, the corresponding elements are weight increment at the symmetry point; therefore, 2/(μs μΔw )
reset back to zero, and a single pulse parallel update on C is provides the average number of states. Note that this definition of
performed. The C update of TTv2 uses the sign information of the the number of states is very different from the definition used for
elements that grew in amplitude beyond one and the row devices developed for memory applications, and it should not be
information t. After these steps, TTv2 loops back and repeats compared against multi-bit storage elements. Besides, additional
these operations for other data examples until it reaches Gaussian noise is introduced to each weight increment and
convergence. decrement to capture the cycle-to-cycle noise: For the
Training Simulations
We performed training simulations for fully connected,
convolutional, and LSTM networks: the same three networks
and datasets studied in Ref Gokmen and Haensch (2020).
FIGURE 3 | LSTM training simulations for SGD, Tiki-Taka, and TTv2 However, the presented results focus on the most challenging
algorithms. Different color curves use an array model with non-symmetric LSTM network referred to as LSTM2-64-WP in Ref Gokmen
devices, μΔw 0.001 (corresponding to 1,200 states), and the multiplicative et al. (2018). This network is composed of two stacked LSTM
cycle-to-cycle update noise at σ cycle 0.3. The square symbols show blocks, each with a hidden state number of 64. Leo Tolstoy’s
the SGD training using linear and symmetric devices where all devices’ slope
parameters are set to zero while all other array parameters remain unchanged.
War and Peace (WP) novel is used as a dataset, and it is split into
The open circles are the floating-point baseline. training and test sets as 2,933,246 and 325,000 characters with a
total vocabulary of 87 characters. This task performs a
character-based language model where the input to the
multiplicative noise model Δwij∓ → Δwij∓ (1 + σ cycle ξ), whereas for network is a sequence of characters from the WP novel, and
the additive noise model Δwij∓ → Δwij∓ + Δwmin,ij σ cycle ξ, where the network is trained with the cross-entropy loss function to
σ cycle is set to 0.3 or 1 for different experiments, and ξ is predict the next character in the sequence. LSTM2-64-WP has
sampled from a unit Gaussian for each coincidence event. three different weight matrices for SGD, and including the
FIGURE 5 | (A, B, C) Blue curves: The evolution of three different weights (corresponding to three different devices with non-symmetric behavior, σ cycle 1 and
about 15 states) during TTv2 training. Red curves show the sign of the updates and the expected average saturation value for the corresponding device. (D) The
evaluation of a linear and symmetric device with σ cycle 0.3 and more than 1,000 states. (E) LSTM training simulations for SGD, Tiki-Taka, and TTv2 algorithms. Different
color curves use an extremely noisy array model with non-symmetric devices, μΔw 0.08 (corresponding to 15 states), and the additive cycle-to-cycle update noise
with σ cycle 1. The square symbols show the SGD training baseline from Figure 3 with symmetric device arrays with 1,200 states and cycle-to-cycle noise at σ cycle 0.3.
The open circles are the floating-point baseline.
weight opposite to the intended direction, as illustrated in Figures arrays trade space complexity for time complexity, whereas
5A–C. Furthermore, same sign updates are encouraged to computational complexity remains unchanged. As a result of
accelerate the learning along the dimensions that have higher this spatial mapping, crossbar-based analog accelerators require a
confidence. multi-tile architecture design irrespective of the training
Finally, we emphasize that, in contrast to SGD and Tiki-Taka, algorithm so that each neural network layer and the
TTv2 only fails gracefully at these extremely challenging corresponding weights can be allocated on separate tiles.
hardware settings. We note that the continued training Nevertheless, RPU arrays provide a scalable solution for a
further improves the performance of TTv2 until 200 epochs, spatially mapped weight stationary architecture for training
and a test error of 1.57 is achieved for the modified TTv2. This workloads thanks to the nano-scale device concepts.
test error is almost identical to one achieved by the symmetric As highlighted in Algorithm 2, TTv2 uses the same tile
device SGD baseline with 1,200 states and many orders of operations and therefore running TTv2 on array architectures
magnitude less noise. All these results show that TTv2 is requires no change in the tile design compared to SGD or Tiki-
superior to Tiki-Taka and SGD, especially when the analog Taka. Assuming the tile design remains unchanged, a pair of
hardware becomes noisy and provides a very limited number of device arrays operated differentially with the supporting
states on RPU devices. peripheral circuits, TTv2 (like Tiki-Taka) requires twice more
tiles to allocate A and C separately. However, alternatively, the
Implementation Cost of TTv2 logical A and C values can be realized using only three devices by
The true benefit of using device arrays for training workloads sharing a common reference, as described in Ref Onen et al.
emerges when the required gradient computation (and (2021). In that case, logical A and C matrices can be absorbed into
processing) step is performed in the array using the RPU a single tile design composed of three device arrays and operated
device properties. As mentioned in the introduction, the in a time multiplex fashion. This tile design minimizes or even
gradient computation is 1/3 of the training operations possibly eliminates the area cost of TTv2 and Tiki-Taka
performed on the weights that the hardware must handle compared to SGD.
efficiently. Irrespective of the layer type, such as convolution, In contrast to A and C matrices allocated on analog arrays, H
fully connected, or LSTM, for an n × n weight matrix in a neural does not require any spatial mapping as it is allocated digitally,
network, each gradient processing step per weight reuse has a and it can reside on an off-chip memory. Furthermore, we
computational complexity of O(n2). RPU arrays perform the emphasize that the digital H processing of TTv2 must not be
required gradient processing step efficiently at O(1) constant confused with the gradient computation step. For an n × n weight
time using array parallelism. Specifically, analog arrays deliver matrix in a neural network, the computational complexity of the
O(1) time complexity simply because the array has O(n2) operations performed on H is only O(n), even for the most
compute resources (RPU devices). In this scheme, each aggressive setting of ns 1. As detailed in Algorithm 2, only
computation is mapped to a resource, and consequently, RPU a single row of H is accessed and processed digitally for ns parallel
array update operations on A. Therefore, H processing has In summary, compared to SGD, TTv2 introduces extra digital
reduced computational complexity compared to gradient costs that are only on the order of 1/ns, whereas it brings orders of
computation: O(n) vs. O(n2). This property differentiates TTv2 magnitude relaxation to many stringent analog hardware specs.
from other approaches performing the gradient computation in For instance, ns 5 provided the best training results for the
the digital domain with O(n2) complexity (Nandakumar et al., LSTM network, and for that network, the additional burden
2020). Regardless, the digital H processing in TTv2 brings introduced to digital compute and memory bandwidth
additional digital computation and memory bandwidth remains less than 20%. For the first convolutional layer of the
requirements compared to SGD or Tiki-Taka. To understand MNIST problem, ns 576 is used, making the additional cost
the extra burden introduced by H in TTv2, we must compare it to negligible (Gokmen and Haensch, 2020). However, we note that
the burden already handled by the digital components for the the neural networks come in many different flavors, beyond those
SGD algorithm. We argue that the extra burden introduced in studied in this manuscript, with different stress points on various
TTv2 is usually only on the order of 1/ns, and the digital hardware architectures. Our complexity arguments should only
components required by the SGD algorithm can also handle be used to compare the relative overhead of TTv2 compared to
the H processing of TTv2. SGD, assuming a fixed analog crossbar-based architecture and
A weight reuse factor (ws) for each layer in a neural network is particular neural network layers. Detailed power/performance
determined by various factors, such as time unrolling steps in an analysis of TTv2 with optimized architecture for a broad class of
LSTM, reuse of filters for different image portions in a neural network models requires additional studies.
convolution, or simply using mini-batches during training. For
an n × n weight matrix with a weight reuse factor of ws, the PART II: Model Extraction
compute performed on the analog array is O(n2.ws). In contrast, Machine learning experts try various neural network
the storage and processing performed digitally for the activations architectures and hyper-parameter settings to obtain the
and error backpropagation are usually O(n.ws). We emphasize best performing model during model development.
that these O(n.ws) compute and storage requirements are Therefore, accelerating the DNN training process is
common to TTv2, Tiki-Taka, and SGD and are already extremely important. However, once the desired model is
addressed by digital components. obtained, it is equally important to deploy the model in the
The digital filter of TTv2 computes straightforward vector- field successfully. Even though training may use one set of
vector additions and thresholds, which require O(n) operations hardware, numerous users likely run the deployed model on
performed only after ns weight reuses. As mentioned above, SGD several hardware architectures, separate from the one the
(likewise Tiki-Taka and TTv2) uses digital units to compute the machine learning experts trained the model with. Therefore,
activations and the error signals, both of which are usually to close the development and deployment lifecycle, the desired
O(n.ws). Therefore, the digital compute needed for the H model must be extracted from the analog hardware for its
processing of TTv2 increases the total digital compute by deployment on another hardware.
O(n.ws/ns). In contrast to digital solutions, the weights of the model are
Additionally, the filter requires the H matrix to be stored not directly accessible on analog hardware. Analog arrays encode
digitally. H is as large as the neural network model and requires the model’s weights, and the tile’s noisy mat-vec limits access to
off-chip memory storage and access. One may argue that this these weight matrices. Therefore, the extraction of the model
defeats the purpose of using analog crossbar arrays. However, from analog hardware is a non-trivial task. Furthermore, the
note that even though the storage requirements for H are model extraction must produce a good representation of the
O(n2), the access to H happens one row at a time, which is trained model to be deployed without loss of accuracy on another
O(n). Therefore, as long as the memory bandwidth can sustain analog or a completely different digital hardware for inference
access to H, the storage requirement is a secondary concern workloads.
that can easily be addressed by allocating space on external off- In Part II, we first provide how an accurate weight extraction
chip memory. This increases the required storage capacity can be performed from noisy analog hardware. Then we further
from O(n.ws) (only for activations) to O(n.ws) + O(n2) generalize this method to obtain an accurate model average over
(activations + H). the TTv2 iterates. Ref Izmailov et al. (2019a) showed that the
Finally, assuming H resides on an off-chip memory, the Stochastic Weight Averaging (SWA) procedure that performs a
hardware architecture must provide enough memory simple averaging of multiple points along the trajectory of SGD
bandwidth to access H. As noted in Algorithm 2, access to H leads to better generalization than conventional training. Our
is very regular, and only a single row of H is needed after ns analog-hardware-friendly SWA on TTv2 iterates shows that these
weight reuses. For SGD (and hence for Tiki-Taka and TTv2), the techniques inspired by the Bayesian treatment of neural networks
activations computed in the forward pass are first stored in off- can also be applied to analog training hardware successfully. We
chip memory and then fetched from it to compute the error show that the model averaging further boosts the extracted
signals during the backpropagation. The activation store and model’s generalization performance and provides a model that
loads are also usually O(n.ws), and therefore the additional access is even better than the trained model itself, enabling the
to H in TTv2 similarly increases required memory bandwidth by deployment of the extracted model virtually on any other
about O(n.ws/ns). hardware.
Accurate Weight Extraction Note, in this linear regression formalism, the tile noise and
Analog tiles perform mat-vec on the stored matrices. Therefore, quantization error correspond to aleatoric uncertainty and
naively one can perform a series of mat-vecs using one-hot cannot be improved. However, the weight estimates are not
encoded inputs to extract the stored values one column (or limited by the aleatoric uncertainty; and instead, the epistemic
one row) at a time. However, this scheme results in a very uncertainty limits these estimates. For the data shown in
crude estimation of the weights due to the mat-vec noise and Figure 6C, the standard deviation in weight estimation
limited ADC resolution. Instead, we perform a series of mat-vecs (corresponding to the epistemic uncertainty) is 0.002, only
using random inputs and then use the conventional linear 0.1% of the nominal weight range of [−1, 1] used for these
regression formula, Eq. 4, to estimate the weights. experiments.
√ The uncertainty in weight estimates scales with
1/ number of mat vecs, and if needed, this uncertainty can
−1 T
C XX T XY T (4) be further reduced by performing more measurements.
In Eq. 4, C is an estimate of the ground truth matrix C stored Accurate Model Average
on the tile, X has the inputs used during weight extraction, and Y As shown in Ref Izmailov et al. (2019a), SWA performs a simple
has the resulting outputs read from the tile. Both X and Y are averaging of multiple points along the trajectory of SGD and
written in matrix form, capturing all the mat-vecs performed on leads to better generalization than conventional training. This
the tile. SWA procedure approximates the Fast Geometric Ensemble
Figure 6 shows the quality of different weight estimations for a (FGE) approach with a single model. Furthermore, Ref Yang
simulated tile of size 512 × 512 with the same analog array et al. (2019) showed that SWA brings benefits to low precision
assumptions described in Part I. When one-hot encoded input training. Here, we propose that weight averaging over TTv2
vectors are used only once, corresponding to 512 mat-vecs, the iterates would also bring similar gains and possibly overcome
correlation of the extracted values to the ground truth is very poor noisy updates unique to the RPU devices. However, obtaining
due to analog mat-vec noise (σ MV ) and ADC quantization, as the weight averages from analog hardware may become
seen in Figure 6A. Repeating the same measurements 20 times, prohibitively expensive. Naïvely, the weights can be first
corresponding to a total of 10,240 mat-vecs, improves the quality extracted from analog hardware after each iteration and then
of the estimate (Figure 6B). However, the best estimate is accumulated in the digital domain to compute averages.
obtained when completely random inputs with uniform However, this requires thousands of mat-vecs per iteration
distribution are used, as illustrated in Figure 6C. We note that and therefore is not feasible.
the total number of mat-vecs is the same for Figures 6B,C, and Instead, to estimate the weight averages, we perform a series of
yet Figure 6C provides a much better estimate. This is because the mat-vecs that are very sparse in time but performed while the
completely random inputs have the highest entropy (information training progresses and then use the same linear regression
content), and therefore they provide the best estimate of the formula to extract the weights. Since the mat-vecs are
ground truth for the same number of mat-vecs. performed while weights are still evolving, the extracted values
FIGURE 6 | (A–C) Correlation between the ground truth weights and the extracted values using different input forms and number of mat-vecs for a simulated 512 ×
512 tile. Red lines are guides to the eye showing perfect correlation.
closely approximate the weight averages for that training period. These test errors assume Model-I runs inferences on the same
For instance, during the last 10 epochs of the TTv2 iterates, we analog hardware it trained on and form our baseline.
performed 100 K mat-vecs with uniform-random inputs and We apply our model extraction technique in the first
showed that it is sufficient to estimate the actual weight experiment and obtain the weights using only 10 K mat-vecs
averages with less than 0.1% uncertainty. with random inputs. We refer to this extracted model as Model-
We note that about 60 M mat-vecs on C and 30 M updates Ix, and it is an estimate of Model-I. We evaluate the test error of
on A are performed during 10 epochs of training. Therefore, Model-Ix when it runs either on another analog hardware (with
the additional 100 K mat-vecs on C needed for weight the same analog array properties) or digital hardware. As
averaging increases the compute on the analog tiles by summarized in Table 1, Model-Ix’s test error remains
only 0.1%. Furthermore, the input and output vectors (x, y) unchanged on the new analog hardware compared to Model-I,
for each mat-vec can be processed on the fly by accumulating showing our model extraction technique’s success. Interestingly,
the results of xxT and xyT on two separate matrices in the the inference results of Model-Ix are better on the digital
digital domain: Mxx ← Mxx + xxT and Mxy ← Mxy + xyT . Then hardware, and the test errors drop to 1.583 and 1.524
at the end of the training, one matrix inversion and a final respective for the 50th and 200th epochs. These improvements
matrix-matrix multiply need to be performed to complete all are due to the absence of the mat-vec noise introduced by the
the steps needed to estimate the weight averages: C avg forward propagation on analog hardware. However, these results
((Mxx )−1 Mxy )T . also highlight that the analog training yields a better model than
In practical applications, a separate conventional digital the test error on the same analog hardware indicates. Therefore,
processor (like CPU) can perform the computations needed such benefits ease analog hardware’s adoption for training only
for weight averages by only receiving the results of the mat- purposes, and the improved test results on digital hardware are
vecs from the analog accelerator. Note that the CPU can generate the relevant metrics for such a use case.
the same input vectors by using the same random seed. Therefore, We implement our model averaging technique using 100 K
Mxx and its inverse can be computed and stored well ahead of mat-vecs with random inputs applied between 40–50 or 180–200
time, even before training starts. Furthermore, the same input epochs in the following experiment. We refer to the extracted
vectors and a common (Mxx )−1 can extract the weight averages model average as Model-Iavg, and the test error for Model-Iavg is
from multiple analog tiles. After all these optimizations, even a also evaluated on analog or digital hardware. In all cases, as
conventional digital processor can sustain the computation illustrated in Table 1, Model-Iavg gives non-trivial improvements
needed for Mxy from multiple tiles and provide the weight compared to Model-Ix (and Model-I). These improvements on
averages at the end of training. the averaged models’ generalization performance show the
success of our model averaging technique. We emphasize that
Inference Results the model training is performed on extremely noisy analog
To test the validity of the proposed weight extraction and hardware using TTv2. Nevertheless, the test error achieved by
averaging techniques, we study the same model trained on Model-Iavg on digital hardware is 1.454, just shy of the FP model’s
extremely noisy analog hardware using TTv2 with the performance at about 1.325.
hysteretic threshold. We refer to this model as Model-I. As Finally, to further illustrate the success of the proposed model
shown in Figure 5E, the test error of Model-I at the end of extraction and averaging techniques, we performed simulations
the 50th and 200th epochs are 1.633 and 1.570, respectively. for another two models, Model II and III, which are also
Signal-to-noise used for Weight Carefully pruned network Randomly pruned network
pruning proportion removed (%)
Random pruning experiments are performed 10 times. The table reports the mean and standard deviation of these 10 experiments for the randomly pruned networks. An untrained
network gives ∼4.46 test error corresponding to a random guess.
Random disturb experiments are performed 10 times. The table reports the mean and standard deviation of these 10 experiments.
summarized in Table 1. Like Model-I, these models are also DISCUSSION AND FUTURE DIRECTIONS
trained on noisy analog hardware but with slightly relaxed array
assumptions. The only two differences compared to Model-I are DNN training using SGD is simply an optimization algorithm
1) Model-II and III both used analog arrays with the additive that provides a point estimate of the DNN parameters at the end
cycle-to-cycle update noise at σ cycle 0.3, 2) Model-II and III of the training. In this frequentist view, a hypothesis is tested
respectively had 60 and 120 states on RPU devices. For these without assigning any probability distribution to the DNN
slightly relaxed but still significantly noisy analog hardware parameters and lacks the representation of uncertainty. More
settings, both Model-II and III provide test results on the recently, however, the Bayesian treatment of DNNs has gained
digital hardware that are virtually indistinguishable from the more traction with new approximate Bayesian approaches
FP model when the model averages between 180–200 epochs (Wilson, 2020). Bayesian approaches treat the DNN
are used. parameters as random variables with probabilities. We believe
We note that the inference simulations performed on analog many exciting directions for future research may connect these
hardware did not include any weight programming errors that approximate Bayesian approaches and neural networks running
may otherwise exist in real hardware. Depending on its strength, on noisy analog hardware.
these weight programming errors cause an accuracy drop on the For instance, Ref Maddox et al. (2019) showed that a simple
analog hardware used solely for inference purposes. baseline for Bayesian uncertainty could be formed by determining
Additionally, after the initial programming, the accuracy may the weight uncertainties from the SGD iterates, referred to as SWA-
further decline over time due to device instability, such as the Gaussian. It is empirically shown that SWA-Gaussian
conductance drift (Mackin et al., 2020; Joshi et al., 2020). approximates the shape of the true posterior distribution of the
Therefore, any analog hardware targeting inference weights, described by the stationary distribution of SGD iterates.
workloads must address these non-idealities. However, we We can intuitively generalize these results to the TTv2 algorithm
emphasize that these problems are unique to inference running on analog hardware. For instance, the proposed TTv2
workloads. Instead, if analog hardware is targeting training algorithm updates a tiny fraction of the neural network weights
workloads only, these problems become obsolete. when enough evidence is accumulated by A and H’s gradient
Furthermore, the unique challenges of the analog training processing steps. Nevertheless, the updates on weights are still noisy
hardware, namely the limited number of states on RPU due to stochasticity in analog hardware. Therefore, TTv2 iterates
devices and the update noise, are successfully handled by our resemble the Gibbs sampling algorithm used to approximate a
proposed TTv2 training algorithm and the model averaging posterior multivariate probability distribution governed by the loss
technique. As illustrated above, even very noisy analog surface of the DNN. Assuming this intuition is correct, analyzing
hardware can deliver models on par in their accuracy the uncertainty in weights over TTv2 iterates may provide a simple
compared to FP models. In addition, after the training Bayesian treatment of a DNN, similar to SWA-Gaussian.
process is performed on analog hardware using TTv2, the To test the feasibility of the above arguments, we performed
extracted model average can be deployed on various digital the following experiments that are motivated by the results of
hardware and perform inference without any accuracy loss. SWA-Gaussian (Maddox et al., 2019) and Bayes-by-Backprop
Therefore, these results provide a clear path for analog hardware (Blundell et al., 2015): First, we extract the mean (μi ) and the
to be employed to accelerate DNN training workloads. standard deviation (σ i ) of each weight from the TTv2 iterates
and define a signal-to-noise ratio as μi /σ i . Then we remove the uncertainty calibration. However, these ideas require further
weights with the lowest signal-to-noise ratio below a certain value investigation, and new techniques that can also extract the
and compare the inference performance of this carefully pruned weight uncertainty from analog hardware are needed.
network to the unpruned one. We also look at the performance Furthermore, extending this work to larger and more extensive
degradation of a randomly pruned network with the same amount networks is a general task for the feasibility of analog crossbar
of weight pruning. Table 2 summarizes the results of these arrays, not only restricted to the work presented here.
experiments performed for Model-III from 180 to 200 epochs.
As illustrated in Table 2, the carefully pruned network’s
performance (1.331)
is almost identical to the unpruned one SUMMARY
(1.326) when μi /σ i < 1, corresponding to 16.7% pruning.
However, the same amount of pruning causes significant In summary, we presented a new DNN training algorithm,
performance degradation for a randomly pruned network TTv2, that provides successful training on extremely noise
(∼3.42). When the signal-to-noise threshold is raised to 3, analog hardware composed of resistive crossbar arrays.
corresponding to 40.8% pruning, the carefully pruned network Compared to previous solutions, TTv2 addresses all sorts of
still performs reasonably well (1.466). Whereas at this level of hardware non-idealities coming from resistive devices and
pruning, a randomly pruned network is not any better than an peripheral circuits and provides orders of magnitude
untrained network producing random predictions. relaxation to many hardware specs. Device arrays with non-
In the second set of experiments, as summarized in Table 3, we symmetric and noisy conductance modulation characteristics
use the extracted means (μi ) and standard deviations (σ i ) and and a limited number of states are enough for TTv2 to train
disturb each weight randomly proportional to its standard neural networks close to their ideal accuracy. In addition, the
deviation: wi μi + ξσ i , where ξ is sampled from a unit model averaging technique applied over TTv2 iterates provides
Gaussian for each weight. Then, we compare the inference further enhancements during the model extraction. In short,
performance of this carefully disturbed network to a randomly we describe an end-to-end training algorithm and model
disturbed network with the same amount of total weight extraction technique from extremely noisy crossbar-based
disturbance. Although the carefully disturbed network analog hardware that matches the performance of full-
performs reasonably well at 1.493, the randomly disturbed precision SGD training. Our techniques can be immediately
networks’ performance significantly degrades to about 3.54. realized and applied to many readily available device
These experiments empirically suggest that the weight technologies that can be utilized for analog deep learning
uncertainty of TTv2 iterates on analog hardware provides accelerators.
additional valuable information about the posterior probability
distribution of the weights governed by the loss surface of the
DNN. The results illustrated in Tables 2, 3 do not address how DATA AVAILABILITY STATEMENT
the weight uncertainty can be extracted from analog hardware in
practical settings; however, suppose this information can be The original contributions presented in the study are included in
extracted. In that case, the weight uncertainty can be used to the article/supplementary material, further inquiries can be
sparsify the DNN during the model deployment on digital directed to the corresponding author.
hardware (Blundell et al., 2015). Alternatively, the weight
uncertainties can be leveraged to devise better programming
routines while transferring the model to another noisy analog AUTHOR CONTRIBUTIONS
hardware. In addition, a low dimensional subspace can be
constructed over TTv2 iterates so that the model can be TG conceived the original idea, developed the methodology,
deployed as a Bayesian neural network, similar to the results wrote the simulation code, analyzed and interpreted the
presented in Ref Izmailov et al. (2019b). The Bayesian model results, and drafted the manuscript.
averaging performed even in low dimensional subspaces
produces accurate predictions and well-calibrated predictive
uncertainty (Izmailov et al., 2019b). We believe that noisy ACKNOWLEDGMENTS
analog hardware with modified learning algorithms can also
accelerate Bayesian approaches while simultaneously providing Author thanks to Wilfried Haensch for illuminating discussions
many known benefits, such as improved generalization and and Paul Solomon for careful reading of the manuscript.
Agarwal, S., Plimpton, S. J., Hughart, D. R., Hsia, A. H., Richter, I., Cox, J. A., et al.
REFERENCES (2016). Resistive Memory Device Requirements for a Neural Network Accelerator.
Vancouver, BC, Canada: IJCNN. doi:10.1109/IJCNN.2016.7727298
Agarwal, S., Gedrim, R. B. J., Hsia, A. H., Hughart, D. R., Fuller, E. J., Talin, A. A., Ambrogio, S., Narayanan, P., Tsai, H., Shelby, R. M., Boybat, I., di Nolfo, C.,
James, C. D., Plimpton, S. J., and Marinella, M. J. (2017). “Achieving Ideal et al. (2018). Equivalent-accuracy Accelerated Neural-Network Training
Accuracies in Analog Neuromorphic Computing Using Periodic Carry,” in Using Analogue Memory. Nature 558, 60–67. doi:10.1038/s41586-018-
Symposium on VLSI Technology, Kyoto, Japan. doi:10.23919/vlsit.2017.7998164 0180-5
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). “Weight Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019).
Uncertainty in Neural Networks,” in Proceedings of the 32nd International “A Simple Baseline for Bayesian Uncertainty in Deep Learning,” in Advances in
Conference on International Conference on Machine Learning, PMLR 37, Neural Information Processing Systems (Vancouver, BC, Canada: NeurIPS),
1613–1622. 32, 13153–13164.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. Miyashita, D., Lee, E. H., and Murmann, B. (2016). “Convolutional Neural
(2020). “Language Models Are Few-Shot Learners,” arXiv:2005.14165 [cs.CL]. Networks Using Logarithmic Data Representation,” arXiv:1603.01025 [cs.NE].
Burr, G. W., Narayanan, P., Shelby, R. M., Sidler, S., Boybat, I., di Nolfo, C., and Nandakumar, S. R., Le Gallo, M., Piveteau, C., Joshi, V., Mariani, G., Boybat, I.,
Leblebici, Y. (2015). “Large-scale Neural Networks Implemented with Non- et al. (2020). Mixed-Precision Deep Learning Based on Computational
volatile Memory as the Synaptic Weight Element: Comparative Performance Memory. Front. Neurosci. 14, 406. doi:10.3389/fnins.2020.00406
Analysis (Accuracy, Speed, and Power),” in IEDM (International Electron Nvidia. (2021). Available: https://www.nvidia.com/en-us/data-center/a100/.
Devices Meeting). doi:10.1109/iedm.2015.7409625 Onen, M., Gokmen, T., Todorov, T. K., Nowicki, T., Alamo, J. A. D., Rozen, J., et al.
Burr, G. W., Shelby, R. M., Sebastian, A., Kim, S., Kim, S., Sidler, S., et al. (2017). (2021). Neural Network Training with Asymmetric Crosspoint Elements.
Neuromorphic Computing Using Non-volatile Memory. Adv. Phys. X 2, submitted for publication.
89–124. doi:10.1080/23746149.2016.1259585 Rasch, M. J., Gokmen, T., and Haensch, W. (2020). Training Large-Scale Artificial
Cloud Tpu. (2007). Available: https://cloud.google.com/tpu/docs/bfloat16. Neural Networks on Simulated Resistive Crossbar Arrays. IEEE Des. Test. 37
Fuller, E. J., Keene, S. T., Melianas, A., Wang, Z., Agarwal, S., Li, Y., et al. (2019). Parallel (2), 19–29. doi:10.1109/mdat.2019.2952341
Programming of an Ionic Floating-Gate Memory Array for Scalable Neuromorphic Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning
Computing. Science 364 (6440), 570–574. doi:10.1126/science.aaw5581 Representations by Back-Propagating Errors. Nature 323, 533–536.
Gokmen, T., and Haensch, W. (2020). Algorithm for Training Neural Networks on doi:10.1038/323533a0
Resistive Device Arrays. Front. Neurosci. 14, 103. doi:10.3389/fnins.2020.00103 Steinbuch, K. (1961). Die Lernmatrix. Kybernetik 1, 36–45.
Gokmen, T., Onen, M., and Haensch, W. (2017). Training Deep Convolutional Strubell, E., Ganesh, A., and McCallum, A. (2019). Energy and Policy
Neural Networks with Resistive Cross-Point Devices. Front. Neurosci. 11, 538. Considerations for Deep Learning in NLP," ACL 2019 - 57th. Annu. Meet.
doi:10.3389/fnins.2017.00538 Assoc. Comput. Linguist. Proc. Conf., 3645–3650.
Gokmen, T., Rasch, M. J., and Haensch, W. (2018). Training LSTM Networks with Sun, X., Choi, J., Chen, C.-Y., Wang, N., Venkataramani, S., Srinivasan, V., et al.
Resistive Cross-Point Devices. Front. Neurosci. 12, 745. doi:10.3389/fnins.2018.00745 (2019). Hybrid 8-bit Floating point (HFP8) Training and Inference for Deep
Gokmen, T., and Vlasov, Y. (2016). Acceleration of Deep Neural Network Training Neural Networks. Adv. Neural Inf. Process. Syst. 32, 4901–4910.
with Resistive Cross-Point Devices: Design Considerations. Front. Neurosci. 10, Sun, X., Wang, N., Chen, C.-Y., Ni, J.-M., Agrawal, A., Cui, X., et al. (2020). Ultra-
333. doi:10.3389/fnins.2016.00333 Low Precision 4-bit Training of Deep Neural Networks. Adv. Neural Inf.
Graphcore. (2021). Available: https://www.graphcore.ai/. Process. Syst. 33, 1796–1807.
Grollier, J., Querlioz, D., Camsari, K. Y., Everschor-Sitte, K., Fukami, S., and Stiles, Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S. (2017). Efficient Processing of Deep
M. D. (2020). Neuromorphic Spintronics. Nat. Electron. 3, 360–370. Neural Networks: A Tutorial and Survey. Proc. IEEE, 105, 2295–2329.
doi:10.1038/s41928-019-0360-9 doi:10.1109/jproc.2017.2761740
Yang, G., Zhang, T., Kirichenko, P., Bai, J., Wilson, A. G., and Sa, C. D. (2019). “SWALP: Wilson, A. G. (2020). Bayesian Deep Learning and a Probabilistic Perspective of
Stochastic Weight Averaging in Low-Precision Training,” arXiv:1904.11943 [cs.LG]. Model Construction. International Conference on Machine Learning Tutorial.
Haensch, W., Gokmen, T., and Puri, R. (2019). The Next Generation of Deep Woo, J., and Yu, S. (2018). Resistive Memory-Based Analog Synapse: The Pursuit
Learning Hardware: Analog Computing. Proc. IEEE, 107, 108–122. doi:10.1109/ for Linear and Symmetric Weight Update. IEEE Nanotechnology Mag. 12,
jproc.2018.2871057 36–44. doi:10.1109/mnano.2018.2844902
He, K., Zhang, X., Ren, S., and Sun, J. (2015). “Delving Deep into Rectifiers: Yu, S., Chen, P., Cao, Y., Xia, L., Wang, Y., and Wu, H. (2015). “Scaling-up Resistive
Surpassing Human-Level Performance on ImageNet Classification,” in IEEE Synaptic Arrays for Neuro-Inspired Architecture: Challenges and prospect,” in
International Conference on Computer Vision (ICCV). doi:10.1109/iccv.2015.123 International Electron Devices Meeting (IEDM), Washington, DC, USA
Kim, H., Rasch, M., Gokmen, T., Ando, T., Miyazoe, H., Kim, J.-J., et al. (2019). (IEEE). doi:10.1109/iedm.2015.7409718
“Zero-shifting Technique for Deep Neural Network Training on Resistive Yu, S. (2018). Neuro-inspired Computing with Emerging Nonvolatile Memorys.
Cross-point Arrays,” arXiv:1907.10228 [cs.ET]. Proc. IEEE, 106, 260–285. doi:10.1109/jproc.2018.2790840
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. (2019).
“Averaging Weights Leads to Wider Optima and Better Generalization,” arXiv: Conflict of Interest: TG was employed by the company IBM. The author declares
1803.05407 [cs.LG]. that the research was conducted in the absence of any commercial or financial
Izmailov, P., Maddox, W., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. relationships that could be construed as a potential conflict of interest.
G. (2019b). “Subspace Inference for Bayesian Deep Learning,” in Uncertainty in
Artificial Intelligence (UAI) 115, 1169–1179. Publisher’s Note: All claims expressed in this article are solely those of the authors
Joshi, V., Le Gallo, M., Haefeli, S., Boybat, I., Nandakumar, S. R., Piveteau, C., and do not necessarily represent those of their affiliated organizations, or those of
et al. (2020). Accurate Deep Neural Network Inference Using Computational the publisher, the editors and the reviewers. Any product that may be evaluated in
Phase-Change Memory. Nat. Commun. 11, 2473. doi:10.1038/s41467-020- this article, or claim that may be made by its manufacturer, is not guaranteed or
16108-9 endorsed by the publisher.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep Learning. Nature 521, 436–444.
doi:10.1038/nature14539 Copyright © 2021 Gokmen. This is an open-access article distributed under the terms
Mackin, C., Narayanan, P., Ambrogio, S., Tsai, H., Spoon, K., Fasoli, A., Chen, A., of the Creative Commons Attribution License (CC BY). The use, distribution or
Friz, A., Shelby, R. M., and Burr, G. W. (2020). “Neuromorphic Computing reproduction in other forums is permitted, provided the original author(s) and the
with Phase Change, Device Reliability, and Variability Challenges,” in IEEE copyright owner(s) are credited and that the original publication in this journal is
International Reliability Physics Symposium, Dallas, TX, USA (IRPS). cited, in accordance with accepted academic practice. No use, distribution or
doi:10.1109/irps45951.2020.9128315 reproduction is permitted which does not comply with these terms.