0% found this document useful (0 votes)

28 views14 pages

Frai 04 699148

This document describes a new algorithm called TTv2 that enables more robust training of neural networks on noisy analog hardware compared to SGD and an earlier Tiki-Taka algorithm. TTv2 relaxes requirements on hardware such as needing fewer conductance states (10s vs 1000s) and increasing noise tolerance for conductance modulation and matrix-vector multiplication by about 100x and 10x respectively. Empirical simulations show TTv2 can train networks close to ideal accuracy even at extremely noisy hardware settings with minimal additional computation cost compared to SGD and Tiki-Taka. The document also describes extracting a trained network from the hardware for further deployment, similar to Bayesian model averaging.

Uploaded by

Alex Assis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views14 pages

Frai 04 699148

Uploaded by

Alex Assis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

ORIGINAL RESEARCH

published: 09 September 2021

doi: 10.3389/frai.2021.699148

Enabling Training of Neural Networks

on Noisy Hardware
Tayfun Gokmen *

IBM Research AI, Yorktown Heights, NY, United States

Deep neural networks (DNNs) are typically trained using the conventional stochastic
gradient descent (SGD) algorithm. However, SGD performs poorly when applied to
train networks on non-ideal analog hardware composed of resistive device arrays with
non-symmetric conductance modulation characteristics. Recently we proposed a new
algorithm, the Tiki-Taka algorithm, that overcomes this stringent symmetry requirement.
Here we build on top of Tiki-Taka and describe a more robust algorithm that further relaxes
other stringent hardware requirements. This more robust second version of the Tiki-Taka
algorithm (referred to as TTv2) 1. decreases the number of device conductance states
requirement from 1000s of states to only 10s of states, 2. increases the noise tolerance to
the device conductance modulations by about 100x, and 3. increases the noise tolerance
to the matrix-vector multiplication performed by the analog arrays by about 10x. Empirical
simulation results show that TTv2 can train various neural networks close to their ideal
Edited by: accuracy even at extremely noisy hardware settings. TTv2 achieves these capabilities by
Oliver Rhodes,
The University of Manchester,
complementing the original Tiki-Taka algorithm with lightweight and low computational
United Kingdom complexity digital filtering operations performed outside the analog arrays. Therefore, the
Reviewed by: implementation cost of TTv2 compared to SGD and Tiki-Taka is minimal, and it maintains
Shimeng Yu, the usual power and speed benefits of using analog hardware for training workloads. Here
Georgia Institute of Technology,
United States we also show how to extract the neural network from the analog hardware once the
Emre O. Neftci, training is complete for further model deployment. Similar to Bayesian model averaging, we
University of California, Irvine,
United States
form analog hardware compatible averages over the neural network weights derived from
TTv2 iterates. This model average then can be transferred to another analog or digital
*Correspondence:
Tayfun Gokmen
hardware with notable improvements in test accuracy, transcending the trained model
[email protected] itself. In short, we describe an end-to-end training and model extraction technique for
extremely noisy crossbar-based analog hardware that can be used to accelerate DNN
Specialty section: training workloads and match the performance of full-precision SGD.
This article was submitted to
Machine Learning and Artificial Keywords: learning algorithms, training algorithms, neural network acceleration, Bayesian neural network, in-
Intelligence, memory computing, on-chip learning, crossbar arrays, memristor
a section of the journal
Frontiers in Artificial Intelligence
Received: 22 April 2021
INTRODUCTION
Accepted: 16 August 2021
Published: 09 September 2021 Deep neural networks (DNNs) (LeCun et al., 2015) have achieved tremendous success in multiple
domains outperforming other approaches and even humans (He et al., 2015) at many problems:
Citation:
Gokmen T (2021) Enabling Training of
object recognition, video analysis, and natural language processing are only a few to mention.
Neural Networks on Noisy Hardware. However, this success was enabled mainly by scaling the DNNs and datasets to extreme sizes, and
Front. Artif. Intell. 4:699148. therefore, it came at the expense of needing immense computation power and time. For instance, the
doi: 10.3389/frai.2021.699148 amount of compute required to train a single GPT-3 model composed of 175B parameters is

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 1 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

tremendous: 3,600 Petaflops/s-days (Brown et al., 2020), algorithm. Whereas any non-symmetric device characteristic
equivalent to running 1,000 state-of-the-art NVIDIA A100 modifies the optimization objective and hampers the
GPUs, each delivering 150 Teraflops/s performance for about convergence of SGD based training (Gokmen and Haensch,
24 days. Hence, today’s and tomorrow’s large models are costly to 2020; Onen et al., 2021).
train both financially and environmentally on currently available Many different solutions are proposed to tackle the SGD’s
hardware (Strubell et al., 2019), begging for faster and more converge problem on crossbar arrays. First, widespread efforts to
energy-efficient solutions. engineer resistive devices with symmetric modulation
DNNs are typically trained using the conventional stochastic characteristics have been made (Fuller et al., 2019; Woo and
gradient descent (SGD) and backpropagation (BP) algorithm Yu, 2018; Grollier et al., 2020), but a mature device technology
(Rumelhart et al., 1986). During DNN training, matrix-matrix with the desired behavior remains to be seen. Second, many high-
multiplications; hence repeated multiply and add operations level mitigation techniques have been proposed to overcome the
dominate the total workload. Therefore, regardless of the device asymmetry problem. One critical issue with these
underlying technology, realizing highly optimized multiply and techniques is the serial access to cross-point elements either
add units and sustaining many of these units with appropriate one-by-one or row-by-row (Ambrogio et al., 2018; Agarwal
data paths is practically the only game everybody plays while et al., 2017; Yu et al., 2015). Serial operations such as reading
proposing new hardware for DNN training workloads (Sze et al., conductance values individually, engineering update pulses to
2017). force symmetric modulation artificially, and carrying or resetting
One approach that has been quite successful in the past few weights periodically come with a tremendous overhead for large
years is to design highly optimized digital circuits using the networks. Alternatively, there are approaches that perform the
conventional CMOS technology that leverages reduced- gradient computation outside the arrays using digital processing
precision arithmetic for the multiply and add operations. (Nandakumar et al., 2020). Note that irrespective of the DNN
These techniques are already employed to some extent by architecture, 1/3 of the whole training workload is in the gradient
current GPUs (Nvidia, 2021) and other application-specific- computation. For instance, for the GPT-3 network, 1,200
integrated-circuits (ASIC) designs, such as TPUs (Cloud TPU, Petaflops/s-days are required solely for gradient computation
2007) and IPUs (Graphcore, 2021). There are also many research throughout the training. Consequently, these approaches
efforts extending the boundaries of the reduced precision cannot deliver much more performance than the fully digital
training, using hybrid 8-bit (Sun et al., 2019) and 4-bit (Sun reduced-precision alternatives mentioned above. In short, there
et al., 2020) floating-point and 5-bit logarithmically scaled exist solutions possibly addressing the convergence issue of SGD
(Miyashita et al., 2016) number formats. on non-symmetric device arrays. However, they all defeat the
Alternative to digital CMOS, hardware architectures purpose of performing the multiply and add operations on the
composed of novel resistive cross-point device arrays have RPU device and lose the performance benefits.
been proposed that can deliver significant power and speed In contrast to all previous approaches, we recently proposed a
benefits for DNN training (Gokmen and Vlasov, 2016; new training algorithm, the so-called Tiki-Taka algorithm
Haensch et al., 2019; Burr et al., 2017; Burr et al., 2015; Yu, (Gokmen and Haensch, 2020), that performs all three cycles
2018). We refer to these cross-point devices as resistive processing (forward propagation, error backpropagation, and gradient
unit [RPU (Gokmen and Vlasov, 2016)] devices as they can computation) on the RPU arrays using the physics and
perform all the multiply and add operations needed for training converges with non-symmetric device arrays. Tiki-Taka works
by relying on physics. Out of all multiply and add operations very differently from SGD, and we showed in another study that
during training, 1/3 are performed during forward propagation, non-symmetric device behavior plays a useful role in the
1/3 are performed during error backpropagation, and finally, 1/3 convergence of Tiki-Taka (Onen et al., 2021).
are performed during gradient computation. RPU devices use Here, we build on top of Tiki-Taka and present a more robust
Ohm’s law and Kirchhoff’s law (Steinbuch, 1961) to perform the second version that relaxes other stringent hardware issues by
multiply and add needed for the forward propagation and error orders of magnitude, namely the limited number of states of RPU
backpropagation. However, more importantly, RPUs use the devices and noise. We refer to this more robust second version of
device conductance modulation and memory characteristics to the Tiki-Taka algorithm as TTv2 for the rest of the paper. In the
perform the multiply and add needed during the gradient first part of the paper, we focus on training and present TTv2
computation (Gokmen and Vlasov, 2016). algorithm details and provide simulation results at various
Unfortunately, RPU based crossbar architectures have had hardware settings. We tested TTv2 on various network
only minimal success so far. That is mainly because the training architectures, including fully connected, convolutional, and
accuracy on this imminent analog hardware strongly depends on LSTMs, although the presented results focus on the more
the cross-point elements’ conductance modulation characteristics challenging LSTM network. TTv2 shows significant
when the conventional SGD algorithm is used. One of the key improvements in the training accuracy compared to Tiki-Taka,
requirements is that these devices must symmetrically change even at much more challenging hardware settings. In the second
conductance when subjected to positive or negative pulse stimuli part of the paper, we show an analog-hardware-friendly
(Gokmen and Vlasov, 2016; Agarwal et al., 2016). Theoretically, it technique to extract the trained model from the noisy
is shown that only symmetric devices provide an unbiased hardware. We also generalize this technique and apply it over
gradient calculation and accumulation needed for the SGD TTv2 iterates and extract each weight’s time average from a

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 2 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

particular training period. These weight averages provide a model wij ← wij ∓ Δwmin,ij Fij wij − Δwmin,ij Gij wij (2)
that approximates the Bayesian model average, and it
outperforms the trained model itself. With this new training In Eq. 2, ∓ sign is decided by the external voltage pulse
algorithm and accurate model extraction technique, we show polarity, whereas Δwmin,ij is the incremental weight change due to
that the noisy analog hardware composed of RPU device arrays single pulse coincidence, and Fij (wij ) and Gij (wij ) are the
can provide scalable training solutions that match the performance symmetric (additive) and antisymmetric (subtractive)
of full-precision SGD. combinations of the positive and negative conductance
modulation characteristics (Gokmen and Haensch, 2020), all
PART I: Training of which are the properties of the updated device
In this section, we first give an overview of the device arrays and corresponding to the ith row and jth column. Eq. 2 is very
device update characteristics used for training. Then we present a general and governs the computation (hardware-induced update)
brief background on Tiki-Taka. Finally, we detail TTv2 and performed by the tile for all sorts of RPU device behaviors,
provide comprehensive simulation results on an LSTM assuming the device conductance modulation characteristics
network at various hardware settings. are some function of the device conductance state. If the
conductance modulations are much smaller than the whole
conductance range of operation, Eq. 3 can be derived from
Device Arrays and Conductance Eq. 2.
Modulation Characteristics
i × xj G
wij ← wij + ηδ i × xj Fij wij − ηδ ij wij (3)
Resistive crossbar array of devices performs efficient matrix-
vector multiply (y Wx) using Ohm’s law and Kirchhoff’s
In Eq. 3, xj and δi represent the input values used for updates
law. The device array’s stored conductance values form a
for each column and row, respectively corresponding to
matrix (W), whereas the input vector (x) is transmitted as
activations and errors calculated in the forward and
voltage pulses through the columns, and the resulting vector
backward cycles, and η is a scalar controlling the strength of
(y) is read as current signals from the rows. However, only
the update, all of which are inputs to pulse generation circuitry
positive conductance values are allowed physically.
at the periphery. Here, we use the stochastic pulsing scheme
Therefore, to encode both positive and negative matrix
proposed in Ref Gokmen and Vlasov (2016), and during the
elements, a pair of devices is operated in differential mode.
parallel update, the number of pulses generated by the periphery
With the help of the peripheral circuits supplying the voltage
is bounded by npulse ηmax(|δ i |)max(xj )/μΔw , where μΔw is
inputs and reading out the differential current signals, logical
the mean of Δwmin,ij for the whole tile. Using npulse stochastic
matrix elements (wij ) are mapped to physical conductance
translators generate pulses with the correct probability;
pairs as
therefore, Eq. 3 is valid in expectation. Whereas in the limit
wij Κgij − gij,ref (1) of a single pulse coincidence, the RPU response is governed
by Eq. 2.
where Κ is a global gain factor controlled by the periphery, and gij Figure 1A illustrates a pulse response of a linear and
and gij,ref are the conductance values stored at each pair symmetric device, where F(w) 1 and G(w) 0, and the
corresponding to the ith row and jth column. Moreover, hardware-induced update rule simplifies to the SGD update
crossbar arrays can be easily operated in the transpose mode rule of wij ← wij + η[δi × xj ]. In the literature, this kind of
by changing the periphery’s input and output directions. As a device behavior is usually referred to as the “ideal” device
result, a pair of arrays with the supporting peripheral circuits required for SGD. For a non-linear but symmetric device,
provide a logical matrix (also referred to as a single tile) that any F(w) deviates from unity and becomes a function of w, but
algorithm can utilize to perform a series of matrix-vector G(w) remains zero. For non-symmetric devices, G(w) also
multiplications (mat-vec) using W and W T . deviates from zero and becomes a function of w, hence
For training algorithms, the efficient update of the stored differing from the form required by SGD. Figure 1B illustrates
matrix elements is also an essential component. Therefore, device an exponentially saturating non-symmetric
device where
conductance modulation and memory characteristics are utilized wij ← wij + η[δ i × xj ] − η[δ i × xj ]w provides the computation
to implement a local and parallel update on RPU arrays. During performed by this device. Although this form of update
the update cycle, input signals are encoded as a series of voltage behavior causes convergence issues for SGD, Tiki-Taka trains
pulses and simultaneously supplied to the array’s rows and DNNs successfully with all sorts of non-symmetric devices
columns. Note that the voltage pulses are applied only to the (Gokmen and Haensch, 2020). Therefore, in contrast to SGD,
first set of RPU devices, and the reference devices are kept all sorts of non-symmetric device behaviors can be considered
constant. As a result of voltage pulse coincidence, the “ideal” for Tiki-Taka.
corresponding RPU device changes its conductance by a small Tiki-Taka’s training performance depends on the successful
amount bi-directionally, depending on the voltage polarity. This application of the symmetry point shifting technique (Kim et al.,
incremental change in device conductance results in an 2019), which guarantees G(w 0) 0 for all elements in the tile.
incremental change in the stored weight value, and the RPU This behavior is illustrated for the device in Figure 1B, where the
response is governed by Eq. 2. strengths of the positive and negative weight increments are equal

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 3 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

the two. As shown in Ref Onen et al. (2021), the non-symmetric

behavior is a valuable and required property of the device in the
Tiki-Taka dynamics. During the information exchange between
the two systems, device asymmetry creates a dissipation
mechanism, resulting in minimization of the system’s total
energy (Hamiltonian); hence Tiki-Taka is also called Stochastic
Hamiltonian Descent (Onen et al., 2021). However, the noise
introduced during the transfer of the information (processed
gradients) from A to C caused additional test error for Tiki-Taka
and needed to be addressed (Gokmen and Haensch, 2020).
The schematic in Figure 2C illustrates the TTv2 dynamics,
highlighting our main contribution. TTv2 introduces an
additional stage (H), between the transfer from A to C, which
performs integration in the digital domain, providing a low
pass filtering function. Furthermore, the model’s parameters
are stored solely on C and only updated if H reaches a
FIGURE 1 | Pulse responses and weight modulation characteristics are threshold value. Because of these modifications in TTv2, the
illustrated for two different devices. (A) Symmetric and linear device: Weight model’s parameters are updated more slowly but with higher
increments (red) and decrements (blue) are equal in size and do not depend on
confidence, bringing significant benefits against various
the weight. (B) Exponentially saturating device: Weight increments and
decrements both have linear dependencies on the weight. However, there hardware noise issues. Details of the algorithm are
exists a single weight value at which the strengths of the weight increment and provided below.
decrement are equal. This point is called the symmetry point, and it is at w 0
for the illustrated example. Tiki-Taka Algorithm
Algorithm 1 outlines the details of the Tiki-Taka algorithm. Tiki-
Taka uses two matrices, A and C, and the neural network
parameters are defined by W cA + C, where c is a scalar
hyperparameter set between [0,1]. Using W, Tiki-Taka
in size at w 0. The symmetry point shifting is achieved by computes the activations (x) and the error signals (δ) by
programming the reference device conductance to a value utilizing the conventional backpropagation algorithm. The
corresponding to the updated device’s symmetry point. For activation and error computations are identical to SGD and
the rest of the paper, we assume the symmetry point shifting therefore omitted from the algorithm description. Also, there
is also applied in the context of TTv2. Although we developed are multiple layers, but Algorithm 1 only illustrates the
techniques to eliminate this requirement, it is beyond the scope of operations performed on a single layer for simplicity. After
this paper and will be published elsewhere. performing the forward propagation and the error
backpropagation on A and C (lines 8 and 9), Tiki-Taka
Algorithms updates only A by employing the hardware-induced parallel
SGD, Tiki-Taka, and TTv2 all use the error backpropagation, but update (line 10) using x and δ. ηa is the learning rate used for
they process the gradient information differently and hence are updating A. These operations are repeated for ns times, a
fundamentally distinct algorithms. Figures 2A,B show hyperparameter of Tiki-Taka. After every ns update on A, an
schematics of SGD and Tiki-Taka dynamics (iterations), analog mat-vec is performed on A with an input vector u,
respectively. Tiki-Taka replaces each weight matrix W of SGD resulting in a vector v (line 14). The vector u is generated
with two matrices (referred to as matrix A and C) and creates a each time locally, and it is either a one-hot encoded vector or
coupled dynamical system by exchanging information between a column vector of a Hadamard matrix used in a cyclic fashion.

FIGURE 2 | Schematics of SGD, Tiki-Taka, and TTv2 dynamics.

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 4 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

Using the generated u vector and the result of f (v), C is updated Array Model
by employing the hardware-induced parallel update (line 15). We use a device model like the one presented in Figure 1B but with
vi , if |vi | ≥ T significant array level variability and noise for the training
f (v) is a pointwise function: f (vi ) where T is
0, otherwise simulations. We simulate stochastic translators at the periphery
set to the mat-vec noise. ηc is the learning rate used for updating during the update, and each coincidence event triggers an
C. These operations are repeated for the data examples in the incremental weight change on the corresponding RPU as
training dataset for multiple epochs until a converge criteria is described below. We also introduce noise and signal bounds
met. Following the same practices described in Ref Gokmen and during the matrix-vector multiplications performed on the arrays.
Haensch (2020), here we also use the one-hot encoded u vectors During the update, the weight increments (Δw+ij ) and
and the thresholding f (v) for the LSTM simulations. decrements (Δw−ij ) are assumed to be functions of the current
weight value. For the positive branch Δw+ij Δwmin,ij (1 − slope+ij × wij )
TTv2 Algorithm and for the negative branch Δw−ij Δwmin,ij (1 + slope−ij × wij ), where
Algorithm 2 outlines the details of the TTv2 algorithm. In slope+ij and slope−ij are the slopes that control the dependence of the
addition to A and C matrices allocated on analog arrays, TTv2 weight changes on the current weight values, and Δwmin,ij is the
also allocates another matrix H in the digital domain. This matrix weight change due to a single coincidence event at the symmetry
H is used to implement a low pass filter while transferring the point. This model results in three unique parameters for each
gradient information processed by A to C. In contrast to Tiki- RPU element. All these parameters are sampled independently
Taka, TTv2 uses only the C matrix to encode the neural network’s using a unit Gaussian random variable and then used throughout
parameters, corresponding to c 0. Therefore, the activation (x) the training, providing device-to-device variability. The slopes are
and error (δ) computations are performed using C (lines 10 and obtained using slope+ij μs (1 + σ s ξ +ij ) and slope−ij μs (1 + σ s ξ −ij ),
11). TTv2 does not change the updates performed on A. After ns where μs 1.66, σ s is set to 0.1, 0.2, or 0.3 for different
updates, a mat-vec is performed on A. Unlike Tiki-Taka, TTv2 experiments, and ξ are the independent random samples. The
only uses a one-hot encoded u vector while performing a mat-vec simulation results were insensitive to σ s ; therefore, we only show
on A. This provides a noisy estimate of a single row of A, and the results corresponding to σ s 0.2. The weight increments at the
results are stored in v. After this step, the significant distinction symmetry point are obtained using Δwmin,ij μΔw (1 + σ Δw ξ ij ),
between Tiki-Taka and TTv2 appears. Instead of using u and v to where σ Δw 0.3 and μΔw is the array average varied from 0.6
update C, TTv2 first accumulates v (after scaling with ηc ) on H’s × 10−4 up to 0.15 for different experiments to study the effects
corresponding row, referred to as H(row t). During this digital number of states on training accuracy. We define the number of
vector-vector addition, the magnitude of any element in H(row t) states as the ratio of the nominal weight range to the nominal
may exceed unity. In that case, the corresponding elements are weight increment at the symmetry point; therefore, 2/(μs μΔw )
reset back to zero, and a single pulse parallel update on C is provides the average number of states. Note that this definition of
performed. The C update of TTv2 uses the sign information of the the number of states is very different from the definition used for
elements that grew in amplitude beyond one and the row devices developed for memory applications, and it should not be
information t. After these steps, TTv2 loops back and repeats compared against multi-bit storage elements. Besides, additional
these operations for other data examples until it reaches Gaussian noise is introduced to each weight increment and
convergence. decrement to capture the cycle-to-cycle noise: For the

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 5 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

During the matrix-vector multiplications, we inject additive

Gaussian noise into each output line to account for analog noise.
Therefore, the model becomes y Wx + σ MV ξ, where
σ MV 0.06, corresponding to 10% of the nominal weight
maximum (1/μs ). Moreover, the matrix-vector multiplications
are bounded to 20 times the nominal weight maximum to
account for signal saturation at the output lines. The input
signals are assumed to be between [−1, 1] with a 7-bit input
resolution, whereas the outputs are quantized assuming a 9-bit
ADC. To mitigate the shortcomings of the signal bounds, we use
the noise, bound, and update management techniques described
in Ref Gokmen et al. (2017).

Training Simulations
We performed training simulations for fully connected,
convolutional, and LSTM networks: the same three networks
and datasets studied in Ref Gokmen and Haensch (2020).
FIGURE 3 | LSTM training simulations for SGD, Tiki-Taka, and TTv2 However, the presented results focus on the most challenging
algorithms. Different color curves use an array model with non-symmetric LSTM network referred to as LSTM2-64-WP in Ref Gokmen
devices, μΔw 0.001 (corresponding to 1,200 states), and the multiplicative et al. (2018). This network is composed of two stacked LSTM
cycle-to-cycle update noise at σ cycle 0.3. The square symbols show blocks, each with a hidden state number of 64. Leo Tolstoy’s
the SGD training using linear and symmetric devices where all devices’ slope
parameters are set to zero while all other array parameters remain unchanged.
War and Peace (WP) novel is used as a dataset, and it is split into
The open circles are the ﬂoating-point baseline. training and test sets as 2,933,246 and 325,000 characters with a
total vocabulary of 87 characters. This task performs a
character-based language model where the input to the
multiplicative noise model Δwij∓ → Δwij∓ (1 + σ cycle ξ), whereas for network is a sequence of characters from the WP novel, and
the additive noise model Δwij∓ → Δwij∓ + Δwmin,ij σ cycle ξ, where the network is trained with the cross-entropy loss function to
σ cycle is set to 0.3 or 1 for different experiments, and ξ is predict the next character in the sequence. LSTM2-64-WP has
sampled from a unit Gaussian for each coincidence event. three different weight matrices for SGD, and including the

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 6 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

SGD, even when the mat-vec noise on A is at σ MV 0.06.

Therefore, these simulation results prove the benefits of
introducing the filtering stage while transferring information
from A to C, and TTv2 increases the algorithm’s noise
tolerance to the mat-vec performed by the analog arrays at
least by 10x compared to Tiki-Taka.
To further examine the resilience of TTv2 to other analog
hardware issues, namely the number of states and the cycle-to-
cycle update noise, we performed training simulations by varying
μΔw many decades from 0.6 × 10−4 to 0.15. This 2,500x increase in
μΔw causes a 2,500x reduction in states’ number on RPU devices
from 20,000 down to 8. Furthermore, as μΔw increases, the
amount of noise existing during the pulsed updates increases
by 2,500x since cycle-to-cycle noise is defined relative to the state
definition on each device as described above. Figure 4
summarizes these simulation results, where the test error at
the end of the 50th epoch is reported. For each data point in
Figure 4, we finetuned each algorithm’s hyper-parameters
FIGURE 4 | LSTM training simulations for SGD, Tiki-Taka, and TTv2 independently and reported the best training results. Both
algorithms as a function of μΔw . 10x increase in μΔw results in a simultaneous SGD and Tiki-Taka are very sensitive to the number of states
10x reduction in the number of states and a 10x increase in the cycle-to-cycle
update noise. Circles correspond to an array model with non-symmetric
and the update noise as the test error increases quickly with an
devices, whereas squares are for symmetric and linear devices. All symbols increase in μΔw . Whereas the error for TTv2 remains unchanged
report the test error at the end of the 50th epoch, and the error bars capture for many decades and highlights the orders of magnitude
the test error fluctuations for the last five epochs. Lines are guides to the eye. increased tolerance of TTv2 to the limited number of states
The floating-point baseline is shown with the black horizontal line at 1.32 test
and the update noise. Compared to SGD and Tiki-Taka, TTv2
error. After random weight initialization, an untrained network gives ∼4.46 test
error corresponding to a random guess.
is at least 100x more resilient to these two common hardware
issues that appear during the update cycle on analog arrays.
Finally, in Figure 5, we additionally tested the success of TTv2
biases, they have sizes 256 × (64 + 87 + 1) and 256 × (64 + 64 + 1) at an extremely noisy hardware setting. These simulations assume
for the two LSTM blocks and 87 × (64 + 1) for the fully μΔw 0.08 corresponding to an average of 15 states, but with an
connected layer before the softmax activation. Each matrix of even higher cycle-to-cycle update noise setting with the additive
SGD maps to two seperate A and C matrices for Tiki-Taka and noise model at σ cycle 1. Figures 5A–C illustrate (for three
TTv2. different devices) the amount of update noise and the array
Figure 3 shows simulation results for SGD, Tiki-Taka, and level variability used for TTv2. The blue curves show the
TTv2 for non-symmetric device arrays with μΔw 0.001 evolution of the weights after each pulse during training. The
(corresponding to 1,200 average number of states) and the red curves show the sign of the updates and the expected average
multiplicative cycle-to-cycle noise σ cycle 0.3. Additionally, we saturation value for the corresponding device for positive and
simulate the SGD training using symmetric device arrays where negative pulses. The saturation values are very different due to
all devices’ slope parameters are set to zero while all other array array level variability, and the response to each pulse is very noisy
parameters remain unchanged. We also note that without due to the additive cycle-to-cycle update noise. As a comparison,
changing the analog hardware settings, we virtually remap the we also show the response of a linear and symmetric device with
nominal weight range from [−0.6, 0.6] to [−2, 2] using the digital σ cycle 0.3 and more than 1,000 states in Figure 5D. The noise is
scaling trick shown in Ref Rasch et al. (2020) for all LSTM not even visible for this device used only for the SGD simulations,
simulations. This remapping slightly increases SGD and Tiki- further emphasizing the burden imposed on the TTv2 algorithm.
Taka’s training performance compared to the results published in The training simulations in Figure 5E show that TTv2
Ref Gokmen and Haensch (2020). We also optimized Tiki-Taka’s achieves acceptable training results even at these extremely
hyper-parameters to achieve the best possible training noisy hardware settings. Figure 5E also shows a slightly
performance at this modified weight range. modified TTv2 implementation with a hysteretic threshold
In Figure 3, Tiki-Taka performs significantly better than SGD that achieves a better result than TTv2. In this modified TTv2
for non-symmetric devices, but a clear gap exists between the implementation, we only changed line 20 of TTv2 from hit 0 to
symmetric device SGD and the Tiki-Taka results. This gap is due hit sign(hit )0.6. This change makes the thresholding event
to the noise during the analog mat-vec performed on A (line 14 of asymmetric and hysteretic: Back to back same sign updates on
Tiki-Taka). Ref Gokmen and Haensch (2020) showed that the C happens with a 0.4 threshold, whereas back to back different
remaining gap closes if the noise during the mat-vec on A is sign updates must overcome a threshold of 1.6. These hysteretic
reduced by 10x to σ MV 0.006; however, this low noise setting is updates allow the system to correct itself quickly if the previous
unrealistic for analog hardware. In contrast, TTv2 shows update caused an undesired modulation on the weight. Note that
indistinguishable results compared to the symmetric device the update noise is so large that it may even cause a change in the

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 7 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

FIGURE 5 | (A, B, C) Blue curves: The evolution of three different weights (corresponding to three different devices with non-symmetric behavior, σ cycle 1 and
about 15 states) during TTv2 training. Red curves show the sign of the updates and the expected average saturation value for the corresponding device. (D) The
evaluation of a linear and symmetric device with σ cycle 0.3 and more than 1,000 states. (E) LSTM training simulations for SGD, Tiki-Taka, and TTv2 algorithms. Different
color curves use an extremely noisy array model with non-symmetric devices, μΔw 0.08 (corresponding to 15 states), and the additive cycle-to-cycle update noise
with σ cycle 1. The square symbols show the SGD training baseline from Figure 3 with symmetric device arrays with 1,200 states and cycle-to-cycle noise at σ cycle 0.3.
The open circles are the ﬂoating-point baseline.

weight opposite to the intended direction, as illustrated in Figures arrays trade space complexity for time complexity, whereas
5A–C. Furthermore, same sign updates are encouraged to computational complexity remains unchanged. As a result of
accelerate the learning along the dimensions that have higher this spatial mapping, crossbar-based analog accelerators require a
confidence. multi-tile architecture design irrespective of the training
Finally, we emphasize that, in contrast to SGD and Tiki-Taka, algorithm so that each neural network layer and the
TTv2 only fails gracefully at these extremely challenging corresponding weights can be allocated on separate tiles.
hardware settings. We note that the continued training Nevertheless, RPU arrays provide a scalable solution for a
further improves the performance of TTv2 until 200 epochs, spatially mapped weight stationary architecture for training
and a test error of 1.57 is achieved for the modified TTv2. This workloads thanks to the nano-scale device concepts.
test error is almost identical to one achieved by the symmetric As highlighted in Algorithm 2, TTv2 uses the same tile
device SGD baseline with 1,200 states and many orders of operations and therefore running TTv2 on array architectures
magnitude less noise. All these results show that TTv2 is requires no change in the tile design compared to SGD or Tiki-
superior to Tiki-Taka and SGD, especially when the analog Taka. Assuming the tile design remains unchanged, a pair of
hardware becomes noisy and provides a very limited number of device arrays operated differentially with the supporting
states on RPU devices. peripheral circuits, TTv2 (like Tiki-Taka) requires twice more
tiles to allocate A and C separately. However, alternatively, the
Implementation Cost of TTv2 logical A and C values can be realized using only three devices by
The true benefit of using device arrays for training workloads sharing a common reference, as described in Ref Onen et al.
emerges when the required gradient computation (and (2021). In that case, logical A and C matrices can be absorbed into
processing) step is performed in the array using the RPU a single tile design composed of three device arrays and operated
device properties. As mentioned in the introduction, the in a time multiplex fashion. This tile design minimizes or even
gradient computation is 1/3 of the training operations possibly eliminates the area cost of TTv2 and Tiki-Taka
performed on the weights that the hardware must handle compared to SGD.
efficiently. Irrespective of the layer type, such as convolution, In contrast to A and C matrices allocated on analog arrays, H
fully connected, or LSTM, for an n × n weight matrix in a neural does not require any spatial mapping as it is allocated digitally,
network, each gradient processing step per weight reuse has a and it can reside on an off-chip memory. Furthermore, we
computational complexity of O(n2). RPU arrays perform the emphasize that the digital H processing of TTv2 must not be
required gradient processing step efficiently at O(1) constant confused with the gradient computation step. For an n × n weight
time using array parallelism. Specifically, analog arrays deliver matrix in a neural network, the computational complexity of the
O(1) time complexity simply because the array has O(n2) operations performed on H is only O(n), even for the most
compute resources (RPU devices). In this scheme, each aggressive setting of ns 1. As detailed in Algorithm 2, only
computation is mapped to a resource, and consequently, RPU a single row of H is accessed and processed digitally for ns parallel

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 8 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

array update operations on A. Therefore, H processing has In summary, compared to SGD, TTv2 introduces extra digital
reduced computational complexity compared to gradient costs that are only on the order of 1/ns, whereas it brings orders of
computation: O(n) vs. O(n2). This property differentiates TTv2 magnitude relaxation to many stringent analog hardware specs.
from other approaches performing the gradient computation in For instance, ns 5 provided the best training results for the
the digital domain with O(n2) complexity (Nandakumar et al., LSTM network, and for that network, the additional burden
2020). Regardless, the digital H processing in TTv2 brings introduced to digital compute and memory bandwidth
additional digital computation and memory bandwidth remains less than 20%. For the first convolutional layer of the
requirements compared to SGD or Tiki-Taka. To understand MNIST problem, ns 576 is used, making the additional cost
the extra burden introduced by H in TTv2, we must compare it to negligible (Gokmen and Haensch, 2020). However, we note that
the burden already handled by the digital components for the the neural networks come in many different flavors, beyond those
SGD algorithm. We argue that the extra burden introduced in studied in this manuscript, with different stress points on various
TTv2 is usually only on the order of 1/ns, and the digital hardware architectures. Our complexity arguments should only
components required by the SGD algorithm can also handle be used to compare the relative overhead of TTv2 compared to
the H processing of TTv2. SGD, assuming a fixed analog crossbar-based architecture and
A weight reuse factor (ws) for each layer in a neural network is particular neural network layers. Detailed power/performance
determined by various factors, such as time unrolling steps in an analysis of TTv2 with optimized architecture for a broad class of
LSTM, reuse of filters for different image portions in a neural network models requires additional studies.
convolution, or simply using mini-batches during training. For
an n × n weight matrix with a weight reuse factor of ws, the PART II: Model Extraction
compute performed on the analog array is O(n2.ws). In contrast, Machine learning experts try various neural network
the storage and processing performed digitally for the activations architectures and hyper-parameter settings to obtain the
and error backpropagation are usually O(n.ws). We emphasize best performing model during model development.
that these O(n.ws) compute and storage requirements are Therefore, accelerating the DNN training process is
common to TTv2, Tiki-Taka, and SGD and are already extremely important. However, once the desired model is
addressed by digital components. obtained, it is equally important to deploy the model in the
The digital filter of TTv2 computes straightforward vector- field successfully. Even though training may use one set of
vector additions and thresholds, which require O(n) operations hardware, numerous users likely run the deployed model on
performed only after ns weight reuses. As mentioned above, SGD several hardware architectures, separate from the one the
(likewise Tiki-Taka and TTv2) uses digital units to compute the machine learning experts trained the model with. Therefore,
activations and the error signals, both of which are usually to close the development and deployment lifecycle, the desired
O(n.ws). Therefore, the digital compute needed for the H model must be extracted from the analog hardware for its
processing of TTv2 increases the total digital compute by deployment on another hardware.
O(n.ws/ns). In contrast to digital solutions, the weights of the model are
Additionally, the filter requires the H matrix to be stored not directly accessible on analog hardware. Analog arrays encode
digitally. H is as large as the neural network model and requires the model’s weights, and the tile’s noisy mat-vec limits access to
off-chip memory storage and access. One may argue that this these weight matrices. Therefore, the extraction of the model
defeats the purpose of using analog crossbar arrays. However, from analog hardware is a non-trivial task. Furthermore, the
note that even though the storage requirements for H are model extraction must produce a good representation of the
O(n2), the access to H happens one row at a time, which is trained model to be deployed without loss of accuracy on another
O(n). Therefore, as long as the memory bandwidth can sustain analog or a completely different digital hardware for inference
access to H, the storage requirement is a secondary concern workloads.
that can easily be addressed by allocating space on external off- In Part II, we first provide how an accurate weight extraction
chip memory. This increases the required storage capacity can be performed from noisy analog hardware. Then we further
from O(n.ws) (only for activations) to O(n.ws) + O(n2) generalize this method to obtain an accurate model average over
(activations + H). the TTv2 iterates. Ref Izmailov et al. (2019a) showed that the
Finally, assuming H resides on an off-chip memory, the Stochastic Weight Averaging (SWA) procedure that performs a
hardware architecture must provide enough memory simple averaging of multiple points along the trajectory of SGD
bandwidth to access H. As noted in Algorithm 2, access to H leads to better generalization than conventional training. Our
is very regular, and only a single row of H is needed after ns analog-hardware-friendly SWA on TTv2 iterates shows that these
weight reuses. For SGD (and hence for Tiki-Taka and TTv2), the techniques inspired by the Bayesian treatment of neural networks
activations computed in the forward pass are first stored in off- can also be applied to analog training hardware successfully. We
chip memory and then fetched from it to compute the error show that the model averaging further boosts the extracted
signals during the backpropagation. The activation store and model’s generalization performance and provides a model that
loads are also usually O(n.ws), and therefore the additional access is even better than the trained model itself, enabling the
to H in TTv2 similarly increases required memory bandwidth by deployment of the extracted model virtually on any other
about O(n.ws/ns). hardware.

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 9 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

Accurate Weight Extraction Note, in this linear regression formalism, the tile noise and
Analog tiles perform mat-vec on the stored matrices. Therefore, quantization error correspond to aleatoric uncertainty and
naively one can perform a series of mat-vecs using one-hot cannot be improved. However, the weight estimates are not
encoded inputs to extract the stored values one column (or limited by the aleatoric uncertainty; and instead, the epistemic
one row) at a time. However, this scheme results in a very uncertainty limits these estimates. For the data shown in
crude estimation of the weights due to the mat-vec noise and Figure 6C, the standard deviation in weight estimation
limited ADC resolution. Instead, we perform a series of mat-vecs (corresponding to the epistemic uncertainty) is 0.002, only
using random inputs and then use the conventional linear 0.1% of the nominal weight range of [−1, 1] used for these
regression formula, Eq. 4, to estimate the weights. experiments.
√ The uncertainty in weight estimates scales with
1/ number of mat vecs, and if needed, this uncertainty can
−1 T
C XX T XY T (4) be further reduced by performing more measurements.

In Eq. 4, C is an estimate of the ground truth matrix C stored Accurate Model Average
on the tile, X has the inputs used during weight extraction, and Y As shown in Ref Izmailov et al. (2019a), SWA performs a simple
has the resulting outputs read from the tile. Both X and Y are averaging of multiple points along the trajectory of SGD and
written in matrix form, capturing all the mat-vecs performed on leads to better generalization than conventional training. This
the tile. SWA procedure approximates the Fast Geometric Ensemble
Figure 6 shows the quality of different weight estimations for a (FGE) approach with a single model. Furthermore, Ref Yang
simulated tile of size 512 × 512 with the same analog array et al. (2019) showed that SWA brings beneﬁts to low precision
assumptions described in Part I. When one-hot encoded input training. Here, we propose that weight averaging over TTv2
vectors are used only once, corresponding to 512 mat-vecs, the iterates would also bring similar gains and possibly overcome
correlation of the extracted values to the ground truth is very poor noisy updates unique to the RPU devices. However, obtaining
due to analog mat-vec noise (σ MV ) and ADC quantization, as the weight averages from analog hardware may become
seen in Figure 6A. Repeating the same measurements 20 times, prohibitively expensive. Naïvely, the weights can be ﬁrst
corresponding to a total of 10,240 mat-vecs, improves the quality extracted from analog hardware after each iteration and then
of the estimate (Figure 6B). However, the best estimate is accumulated in the digital domain to compute averages.
obtained when completely random inputs with uniform However, this requires thousands of mat-vecs per iteration
distribution are used, as illustrated in Figure 6C. We note that and therefore is not feasible.
the total number of mat-vecs is the same for Figures 6B,C, and Instead, to estimate the weight averages, we perform a series of
yet Figure 6C provides a much better estimate. This is because the mat-vecs that are very sparse in time but performed while the
completely random inputs have the highest entropy (information training progresses and then use the same linear regression
content), and therefore they provide the best estimate of the formula to extract the weights. Since the mat-vecs are
ground truth for the same number of mat-vecs. performed while weights are still evolving, the extracted values

FIGURE 6 | (A–C) Correlation between the ground truth weights and the extracted values using different input forms and number of mat-vecs for a simulated 512 ×
512 tile. Red lines are guides to the eye showing perfect correlation.

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 10 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

TABLE 1 | Inference results of various models on different hardware.

Model–I, Ix, Iavg Model–II, IIx, IIavg Model–III,IIIx, IIIavg

#States = 15, #States = 60, #States = 120,
σ cycle = 1 σ cycle = 0.3 σ cycle = 0.3

Inference on Analog Hardware

50th Epoch-Trained Model 1.633 1.454 1.430
50th Epoch-Extracted Model 1.633 1.455 1.430
40th–50th Epochs-Extracted Model Avg 1.560 1.425 1.407
200th Epoch-Trained Model 1.570 1.410 1.403
200th Epoch-Extracted Model 1.571 1.410 1.403
180th–200th Epochs-Extracted Model Avg 1.487 1.377 1.372
Inference on Digital Hardware
50th Epoch-Extracted Model 1.583 1.403 1.378
40th–50th Epochs-Extracted Model Avg 1.520 1.379 1.359
200th Epoch-Extracted Model 1.524 1.360 1.350
180th–200th Epochs-Extracted Model Avg 1.454 1.334 1.326

FP Baseline Model: 1.315–1.332.

Repeating the same FP training results in about 0.01 variability in the test error due to the randomness in weight initialization. Bold values provide the baseline training results without model
extraction. Italic values correspond to models that are indistinguishable from the FP model.

closely approximate the weight averages for that training period. These test errors assume Model-I runs inferences on the same
For instance, during the last 10 epochs of the TTv2 iterates, we analog hardware it trained on and form our baseline.
performed 100 K mat-vecs with uniform-random inputs and We apply our model extraction technique in the first
showed that it is sufficient to estimate the actual weight experiment and obtain the weights using only 10 K mat-vecs
averages with less than 0.1% uncertainty. with random inputs. We refer to this extracted model as Model-
We note that about 60 M mat-vecs on C and 30 M updates Ix, and it is an estimate of Model-I. We evaluate the test error of
on A are performed during 10 epochs of training. Therefore, Model-Ix when it runs either on another analog hardware (with
the additional 100 K mat-vecs on C needed for weight the same analog array properties) or digital hardware. As
averaging increases the compute on the analog tiles by summarized in Table 1, Model-Ix’s test error remains
only 0.1%. Furthermore, the input and output vectors (x, y) unchanged on the new analog hardware compared to Model-I,
for each mat-vec can be processed on the fly by accumulating showing our model extraction technique’s success. Interestingly,
the results of xxT and xyT on two separate matrices in the the inference results of Model-Ix are better on the digital
digital domain: Mxx ← Mxx + xxT and Mxy ← Mxy + xyT . Then hardware, and the test errors drop to 1.583 and 1.524
at the end of the training, one matrix inversion and a final respective for the 50th and 200th epochs. These improvements
matrix-matrix multiply need to be performed to complete all are due to the absence of the mat-vec noise introduced by the
the steps needed to estimate the weight averages: C avg forward propagation on analog hardware. However, these results
((Mxx )−1 Mxy )T . also highlight that the analog training yields a better model than
In practical applications, a separate conventional digital the test error on the same analog hardware indicates. Therefore,
processor (like CPU) can perform the computations needed such benefits ease analog hardware’s adoption for training only
for weight averages by only receiving the results of the mat- purposes, and the improved test results on digital hardware are
vecs from the analog accelerator. Note that the CPU can generate the relevant metrics for such a use case.
the same input vectors by using the same random seed. Therefore, We implement our model averaging technique using 100 K
Mxx and its inverse can be computed and stored well ahead of mat-vecs with random inputs applied between 40–50 or 180–200
time, even before training starts. Furthermore, the same input epochs in the following experiment. We refer to the extracted
vectors and a common (Mxx )−1 can extract the weight averages model average as Model-Iavg, and the test error for Model-Iavg is
from multiple analog tiles. After all these optimizations, even a also evaluated on analog or digital hardware. In all cases, as
conventional digital processor can sustain the computation illustrated in Table 1, Model-Iavg gives non-trivial improvements
needed for Mxy from multiple tiles and provide the weight compared to Model-Ix (and Model-I). These improvements on
averages at the end of training. the averaged models’ generalization performance show the
success of our model averaging technique. We emphasize that
Inference Results the model training is performed on extremely noisy analog
To test the validity of the proposed weight extraction and hardware using TTv2. Nevertheless, the test error achieved by
averaging techniques, we study the same model trained on Model-Iavg on digital hardware is 1.454, just shy of the FP model’s
extremely noisy analog hardware using TTv2 with the performance at about 1.325.
hysteretic threshold. We refer to this model as Model-I. As Finally, to further illustrate the success of the proposed model
shown in Figure 5E, the test error of Model-I at the end of extraction and averaging techniques, we performed simulations
the 50th and 200th epochs are 1.633 and 1.570, respectively. for another two models, Model II and III, which are also

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 11 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

TABLE 2 | Inference results of pruned networks for Model-III on digital hardware.

Signal-to-noise used for Weight Carefully pruned network Randomly pruned network
pruning proportion removed (%)

(No pruning) 0 1.326 1.326

i <1 3.42 ± (0.26)
μi /σ
16.7 1.331

μi /σ i < 2 30.2 1.371 4.09 ± (0.32)

μi /σ i < 3 40.8 1.466 4.40 ± (0.29)

Random pruning experiments are performed 10 times. The table reports the mean and standard deviation of these 10 experiments for the randomly pruned networks. An untrained
network gives ∼4.46 test error corresponding to a random guess.

TABLE 3 | Inference results of disturbed networks for Model-III on analog hardware.

Signal-to-noise used for Carefully disturbed network Randomly disturbed network

disturbance

(No disturbance) 1.370 1.370

μi ± σ i 1.493 3.54 ± (0.15)

Random disturb experiments are performed 10 times. The table reports the mean and standard deviation of these 10 experiments.

summarized in Table 1. Like Model-I, these models are also DISCUSSION AND FUTURE DIRECTIONS
trained on noisy analog hardware but with slightly relaxed array
assumptions. The only two differences compared to Model-I are DNN training using SGD is simply an optimization algorithm
1) Model-II and III both used analog arrays with the additive that provides a point estimate of the DNN parameters at the end
cycle-to-cycle update noise at σ cycle 0.3, 2) Model-II and III of the training. In this frequentist view, a hypothesis is tested
respectively had 60 and 120 states on RPU devices. For these without assigning any probability distribution to the DNN
slightly relaxed but still signiﬁcantly noisy analog hardware parameters and lacks the representation of uncertainty. More
settings, both Model-II and III provide test results on the recently, however, the Bayesian treatment of DNNs has gained
digital hardware that are virtually indistinguishable from the more traction with new approximate Bayesian approaches
FP model when the model averages between 180–200 epochs (Wilson, 2020). Bayesian approaches treat the DNN
are used. parameters as random variables with probabilities. We believe
We note that the inference simulations performed on analog many exciting directions for future research may connect these
hardware did not include any weight programming errors that approximate Bayesian approaches and neural networks running
may otherwise exist in real hardware. Depending on its strength, on noisy analog hardware.
these weight programming errors cause an accuracy drop on the For instance, Ref Maddox et al. (2019) showed that a simple
analog hardware used solely for inference purposes. baseline for Bayesian uncertainty could be formed by determining
Additionally, after the initial programming, the accuracy may the weight uncertainties from the SGD iterates, referred to as SWA-
further decline over time due to device instability, such as the Gaussian. It is empirically shown that SWA-Gaussian
conductance drift (Mackin et al., 2020; Joshi et al., 2020). approximates the shape of the true posterior distribution of the
Therefore, any analog hardware targeting inference weights, described by the stationary distribution of SGD iterates.
workloads must address these non-idealities. However, we We can intuitively generalize these results to the TTv2 algorithm
emphasize that these problems are unique to inference running on analog hardware. For instance, the proposed TTv2
workloads. Instead, if analog hardware is targeting training algorithm updates a tiny fraction of the neural network weights
workloads only, these problems become obsolete. when enough evidence is accumulated by A and H’s gradient
Furthermore, the unique challenges of the analog training processing steps. Nevertheless, the updates on weights are still noisy
hardware, namely the limited number of states on RPU due to stochasticity in analog hardware. Therefore, TTv2 iterates
devices and the update noise, are successfully handled by our resemble the Gibbs sampling algorithm used to approximate a
proposed TTv2 training algorithm and the model averaging posterior multivariate probability distribution governed by the loss
technique. As illustrated above, even very noisy analog surface of the DNN. Assuming this intuition is correct, analyzing
hardware can deliver models on par in their accuracy the uncertainty in weights over TTv2 iterates may provide a simple
compared to FP models. In addition, after the training Bayesian treatment of a DNN, similar to SWA-Gaussian.
process is performed on analog hardware using TTv2, the To test the feasibility of the above arguments, we performed
extracted model average can be deployed on various digital the following experiments that are motivated by the results of
hardware and perform inference without any accuracy loss. SWA-Gaussian (Maddox et al., 2019) and Bayes-by-Backprop
Therefore, these results provide a clear path for analog hardware (Blundell et al., 2015): First, we extract the mean (μi ) and the
to be employed to accelerate DNN training workloads. standard deviation (σ i ) of each weight from the TTv2 iterates

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 12 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

and define a signal-to-noise ratio as μi /σ i . Then we remove the uncertainty calibration. However, these ideas require further
weights with the lowest signal-to-noise ratio below a certain value investigation, and new techniques that can also extract the
and compare the inference performance of this carefully pruned weight uncertainty from analog hardware are needed.
network to the unpruned one. We also look at the performance Furthermore, extending this work to larger and more extensive
degradation of a randomly pruned network with the same amount networks is a general task for the feasibility of analog crossbar
of weight pruning. Table 2 summarizes the results of these arrays, not only restricted to the work presented here.
experiments performed for Model-III from 180 to 200 epochs.
As illustrated in Table 2, the carefully pruned network’s
performance (1.331)
is almost identical to the unpruned one SUMMARY
(1.326) when μi /σ i < 1, corresponding to 16.7% pruning.
However, the same amount of pruning causes significant In summary, we presented a new DNN training algorithm,
performance degradation for a randomly pruned network TTv2, that provides successful training on extremely noise
(∼3.42). When the signal-to-noise threshold is raised to 3, analog hardware composed of resistive crossbar arrays.
corresponding to 40.8% pruning, the carefully pruned network Compared to previous solutions, TTv2 addresses all sorts of
still performs reasonably well (1.466). Whereas at this level of hardware non-idealities coming from resistive devices and
pruning, a randomly pruned network is not any better than an peripheral circuits and provides orders of magnitude
untrained network producing random predictions. relaxation to many hardware specs. Device arrays with non-
In the second set of experiments, as summarized in Table 3, we symmetric and noisy conductance modulation characteristics
use the extracted means (μi ) and standard deviations (σ i ) and and a limited number of states are enough for TTv2 to train
disturb each weight randomly proportional to its standard neural networks close to their ideal accuracy. In addition, the
deviation: wi μi + ξσ i , where ξ is sampled from a unit model averaging technique applied over TTv2 iterates provides
Gaussian for each weight. Then, we compare the inference further enhancements during the model extraction. In short,
performance of this carefully disturbed network to a randomly we describe an end-to-end training algorithm and model
disturbed network with the same amount of total weight extraction technique from extremely noisy crossbar-based
disturbance. Although the carefully disturbed network analog hardware that matches the performance of full-
performs reasonably well at 1.493, the randomly disturbed precision SGD training. Our techniques can be immediately
networks’ performance significantly degrades to about 3.54. realized and applied to many readily available device
These experiments empirically suggest that the weight technologies that can be utilized for analog deep learning
uncertainty of TTv2 iterates on analog hardware provides accelerators.
additional valuable information about the posterior probability
distribution of the weights governed by the loss surface of the
DNN. The results illustrated in Tables 2, 3 do not address how DATA AVAILABILITY STATEMENT
the weight uncertainty can be extracted from analog hardware in
practical settings; however, suppose this information can be The original contributions presented in the study are included in
extracted. In that case, the weight uncertainty can be used to the article/supplementary material, further inquiries can be
sparsify the DNN during the model deployment on digital directed to the corresponding author.
hardware (Blundell et al., 2015). Alternatively, the weight
uncertainties can be leveraged to devise better programming
routines while transferring the model to another noisy analog AUTHOR CONTRIBUTIONS
hardware. In addition, a low dimensional subspace can be
constructed over TTv2 iterates so that the model can be TG conceived the original idea, developed the methodology,
deployed as a Bayesian neural network, similar to the results wrote the simulation code, analyzed and interpreted the
presented in Ref Izmailov et al. (2019b). The Bayesian model results, and drafted the manuscript.
averaging performed even in low dimensional subspaces
produces accurate predictions and well-calibrated predictive
uncertainty (Izmailov et al., 2019b). We believe that noisy ACKNOWLEDGMENTS
analog hardware with modified learning algorithms can also
accelerate Bayesian approaches while simultaneously providing Author thanks to Wilfried Haensch for illuminating discussions
many known benefits, such as improved generalization and and Paul Solomon for careful reading of the manuscript.

Agarwal, S., Plimpton, S. J., Hughart, D. R., Hsia, A. H., Richter, I., Cox, J. A., et al.
REFERENCES (2016). Resistive Memory Device Requirements for a Neural Network Accelerator.
Vancouver, BC, Canada: IJCNN. doi:10.1109/IJCNN.2016.7727298
Agarwal, S., Gedrim, R. B. J., Hsia, A. H., Hughart, D. R., Fuller, E. J., Talin, A. A., Ambrogio, S., Narayanan, P., Tsai, H., Shelby, R. M., Boybat, I., di Nolfo, C.,
James, C. D., Plimpton, S. J., and Marinella, M. J. (2017). “Achieving Ideal et al. (2018). Equivalent-accuracy Accelerated Neural-Network Training
Accuracies in Analog Neuromorphic Computing Using Periodic Carry,” in Using Analogue Memory. Nature 558, 60–67. doi:10.1038/s41586-018-
Symposium on VLSI Technology, Kyoto, Japan. doi:10.23919/vlsit.2017.7998164 0180-5

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 13 September 2021 | Volume 4 | Article 699148

Gokmen Training Neural Networks on Noisy Hardware

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). “Weight Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019).
Uncertainty in Neural Networks,” in Proceedings of the 32nd International “A Simple Baseline for Bayesian Uncertainty in Deep Learning,” in Advances in
Conference on International Conference on Machine Learning, PMLR 37, Neural Information Processing Systems (Vancouver, BC, Canada: NeurIPS),
1613–1622. 32, 13153–13164.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. Miyashita, D., Lee, E. H., and Murmann, B. (2016). “Convolutional Neural
(2020). “Language Models Are Few-Shot Learners,” arXiv:2005.14165 [cs.CL]. Networks Using Logarithmic Data Representation,” arXiv:1603.01025 [cs.NE].
Burr, G. W., Narayanan, P., Shelby, R. M., Sidler, S., Boybat, I., di Nolfo, C., and Nandakumar, S. R., Le Gallo, M., Piveteau, C., Joshi, V., Mariani, G., Boybat, I.,
Leblebici, Y. (2015). “Large-scale Neural Networks Implemented with Non- et al. (2020). Mixed-Precision Deep Learning Based on Computational
volatile Memory as the Synaptic Weight Element: Comparative Performance Memory. Front. Neurosci. 14, 406. doi:10.3389/fnins.2020.00406
Analysis (Accuracy, Speed, and Power),” in IEDM (International Electron Nvidia. (2021). Available: https://www.nvidia.com/en-us/data-center/a100/.
Devices Meeting). doi:10.1109/iedm.2015.7409625 Onen, M., Gokmen, T., Todorov, T. K., Nowicki, T., Alamo, J. A. D., Rozen, J., et al.
Burr, G. W., Shelby, R. M., Sebastian, A., Kim, S., Kim, S., Sidler, S., et al. (2017). (2021). Neural Network Training with Asymmetric Crosspoint Elements.
Neuromorphic Computing Using Non-volatile Memory. Adv. Phys. X 2, submitted for publication.
89–124. doi:10.1080/23746149.2016.1259585 Rasch, M. J., Gokmen, T., and Haensch, W. (2020). Training Large-Scale Artificial
Cloud Tpu. (2007). Available: https://cloud.google.com/tpu/docs/bfloat16. Neural Networks on Simulated Resistive Crossbar Arrays. IEEE Des. Test. 37
Fuller, E. J., Keene, S. T., Melianas, A., Wang, Z., Agarwal, S., Li, Y., et al. (2019). Parallel (2), 19–29. doi:10.1109/mdat.2019.2952341
Programming of an Ionic Floating-Gate Memory Array for Scalable Neuromorphic Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning
Computing. Science 364 (6440), 570–574. doi:10.1126/science.aaw5581 Representations by Back-Propagating Errors. Nature 323, 533–536.
Gokmen, T., and Haensch, W. (2020). Algorithm for Training Neural Networks on doi:10.1038/323533a0
Resistive Device Arrays. Front. Neurosci. 14, 103. doi:10.3389/fnins.2020.00103 Steinbuch, K. (1961). Die Lernmatrix. Kybernetik 1, 36–45.
Gokmen, T., Onen, M., and Haensch, W. (2017). Training Deep Convolutional Strubell, E., Ganesh, A., and McCallum, A. (2019). Energy and Policy
Neural Networks with Resistive Cross-Point Devices. Front. Neurosci. 11, 538. Considerations for Deep Learning in NLP," ACL 2019 - 57th. Annu. Meet.
doi:10.3389/fnins.2017.00538 Assoc. Comput. Linguist. Proc. Conf., 3645–3650.
Gokmen, T., Rasch, M. J., and Haensch, W. (2018). Training LSTM Networks with Sun, X., Choi, J., Chen, C.-Y., Wang, N., Venkataramani, S., Srinivasan, V., et al.
Resistive Cross-Point Devices. Front. Neurosci. 12, 745. doi:10.3389/fnins.2018.00745 (2019). Hybrid 8-bit Floating point (HFP8) Training and Inference for Deep
Gokmen, T., and Vlasov, Y. (2016). Acceleration of Deep Neural Network Training Neural Networks. Adv. Neural Inf. Process. Syst. 32, 4901–4910.
with Resistive Cross-Point Devices: Design Considerations. Front. Neurosci. 10, Sun, X., Wang, N., Chen, C.-Y., Ni, J.-M., Agrawal, A., Cui, X., et al. (2020). Ultra-
333. doi:10.3389/fnins.2016.00333 Low Precision 4-bit Training of Deep Neural Networks. Adv. Neural Inf.
Graphcore. (2021). Available: https://www.graphcore.ai/. Process. Syst. 33, 1796–1807.
Grollier, J., Querlioz, D., Camsari, K. Y., Everschor-Sitte, K., Fukami, S., and Stiles, Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S. (2017). Efficient Processing of Deep
M. D. (2020). Neuromorphic Spintronics. Nat. Electron. 3, 360–370. Neural Networks: A Tutorial and Survey. Proc. IEEE, 105, 2295–2329.
doi:10.1038/s41928-019-0360-9 doi:10.1109/jproc.2017.2761740
Yang, G., Zhang, T., Kirichenko, P., Bai, J., Wilson, A. G., and Sa, C. D. (2019). “SWALP: Wilson, A. G. (2020). Bayesian Deep Learning and a Probabilistic Perspective of
Stochastic Weight Averaging in Low-Precision Training,” arXiv:1904.11943 [cs.LG]. Model Construction. International Conference on Machine Learning Tutorial.
Haensch, W., Gokmen, T., and Puri, R. (2019). The Next Generation of Deep Woo, J., and Yu, S. (2018). Resistive Memory-Based Analog Synapse: The Pursuit
Learning Hardware: Analog Computing. Proc. IEEE, 107, 108–122. doi:10.1109/ for Linear and Symmetric Weight Update. IEEE Nanotechnology Mag. 12,
jproc.2018.2871057 36–44. doi:10.1109/mnano.2018.2844902
He, K., Zhang, X., Ren, S., and Sun, J. (2015). “Delving Deep into Rectifiers: Yu, S., Chen, P., Cao, Y., Xia, L., Wang, Y., and Wu, H. (2015). “Scaling-up Resistive
Surpassing Human-Level Performance on ImageNet Classification,” in IEEE Synaptic Arrays for Neuro-Inspired Architecture: Challenges and prospect,” in
International Conference on Computer Vision (ICCV). doi:10.1109/iccv.2015.123 International Electron Devices Meeting (IEDM), Washington, DC, USA
Kim, H., Rasch, M., Gokmen, T., Ando, T., Miyazoe, H., Kim, J.-J., et al. (2019). (IEEE). doi:10.1109/iedm.2015.7409718
“Zero-shifting Technique for Deep Neural Network Training on Resistive Yu, S. (2018). Neuro-inspired Computing with Emerging Nonvolatile Memorys.
Cross-point Arrays,” arXiv:1907.10228 [cs.ET]. Proc. IEEE, 106, 260–285. doi:10.1109/jproc.2018.2790840
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. (2019).
“Averaging Weights Leads to Wider Optima and Better Generalization,” arXiv: Conflict of Interest: TG was employed by the company IBM. The author declares
1803.05407 [cs.LG]. that the research was conducted in the absence of any commercial or financial
Izmailov, P., Maddox, W., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. relationships that could be construed as a potential conflict of interest.
G. (2019b). “Subspace Inference for Bayesian Deep Learning,” in Uncertainty in
Artificial Intelligence (UAI) 115, 1169–1179. Publisher’s Note: All claims expressed in this article are solely those of the authors
Joshi, V., Le Gallo, M., Haefeli, S., Boybat, I., Nandakumar, S. R., Piveteau, C., and do not necessarily represent those of their affiliated organizations, or those of
et al. (2020). Accurate Deep Neural Network Inference Using Computational the publisher, the editors and the reviewers. Any product that may be evaluated in
Phase-Change Memory. Nat. Commun. 11, 2473. doi:10.1038/s41467-020- this article, or claim that may be made by its manufacturer, is not guaranteed or
16108-9 endorsed by the publisher.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep Learning. Nature 521, 436–444.
doi:10.1038/nature14539 Copyright © 2021 Gokmen. This is an open-access article distributed under the terms
Mackin, C., Narayanan, P., Ambrogio, S., Tsai, H., Spoon, K., Fasoli, A., Chen, A., of the Creative Commons Attribution License (CC BY). The use, distribution or
Friz, A., Shelby, R. M., and Burr, G. W. (2020). “Neuromorphic Computing reproduction in other forums is permitted, provided the original author(s) and the
with Phase Change, Device Reliability, and Variability Challenges,” in IEEE copyright owner(s) are credited and that the original publication in this journal is
International Reliability Physics Symposium, Dallas, TX, USA (IRPS). cited, in accordance with accepted academic practice. No use, distribution or
doi:10.1109/irps45951.2020.9128315 reproduction is permitted which does not comply with these terms.

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 14 September 2021 | Volume 4 | Article 699148

Namaqualand After Rain
No ratings yet
Namaqualand After Rain
2 pages
Revised Module GEEED 020 1st Sem 2023 20234
No ratings yet
Revised Module GEEED 020 1st Sem 2023 20234
87 pages
The Osmosis Gummy Bear
100% (3)
The Osmosis Gummy Bear
3 pages
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Matlab File
No ratings yet
Matlab File
16 pages
The Oriflame Logotype: 1 Ori Ame Corporate Identity Manual
No ratings yet
The Oriflame Logotype: 1 Ori Ame Corporate Identity Manual
17 pages
BASE24-eps 2.1.3 - HP NonStop Pre-Install Checklist and Worksheet
No ratings yet
BASE24-eps 2.1.3 - HP NonStop Pre-Install Checklist and Worksheet
34 pages
Specifications CMC Carboxymethylcellulose
50% (2)
Specifications CMC Carboxymethylcellulose
2 pages
Standard Operating Procedures: CNC Router
No ratings yet
Standard Operating Procedures: CNC Router
3 pages
Assessing The Performance of Analog Training For Transfer Learning
No ratings yet
Assessing The Performance of Analog Training For Transfer Learning
5 pages
Impact of Asymmetric Weight Update On Neural Network Training Wi
No ratings yet
Impact of Asymmetric Weight Update On Neural Network Training Wi
21 pages
Sensors 22 08845
No ratings yet
Sensors 22 08845
16 pages
Physical Deep Learning With Biologically Inspired Training Method Gradient-Free Approach For Physical Hardware
No ratings yet
Physical Deep Learning With Biologically Inspired Training Method Gradient-Free Approach For Physical Hardware
12 pages
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
No ratings yet
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
35 pages
Ai Project
No ratings yet
Ai Project
6 pages
Deep Physical Neural Networks Trained With Backpropagation. Nature 2022, P L Mcmohan
No ratings yet
Deep Physical Neural Networks Trained With Backpropagation. Nature 2022, P L Mcmohan
11 pages
Fully On-Chip MAC at 14 NM Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format
No ratings yet
Fully On-Chip MAC at 14 NM Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format
8 pages
Deep Physical Neural Networks Enabled by A Backpro
No ratings yet
Deep Physical Neural Networks Enabled by A Backpro
60 pages
Backpropagation-Free Training of Deep Physical Neural Networks
No ratings yet
Backpropagation-Free Training of Deep Physical Neural Networks
44 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
Thermodynamic Computing System For AI
No ratings yet
Thermodynamic Computing System For AI
9 pages
DD Grupo2
No ratings yet
DD Grupo2
42 pages
1 Mainđf
No ratings yet
1 Mainđf
12 pages
Horovod
No ratings yet
Horovod
5 pages
Generative Ai
No ratings yet
Generative Ai
15 pages
ML Archs
No ratings yet
ML Archs
36 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
Make 04 00004 v3
No ratings yet
Make 04 00004 v3
37 pages
Fast and Energy-Efficient Monocular Depth Estimation On Embedded Systems
No ratings yet
Fast and Energy-Efficient Monocular Depth Estimation On Embedded Systems
183 pages
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
No ratings yet
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
14 pages
Handling Stuck-at-Fault Defects Using Matrix Transformation For Robust Inference of DNNs
No ratings yet
Handling Stuck-at-Fault Defects Using Matrix Transformation For Robust Inference of DNNs
13 pages
Supporting Future Electrical Utilities - Infotech
No ratings yet
Supporting Future Electrical Utilities - Infotech
6 pages
1 s2.0 S0893608021003841 Main
No ratings yet
1 s2.0 S0893608021003841 Main
13 pages
SPH Whitepaper
No ratings yet
SPH Whitepaper
25 pages
Application of Data Augmentation On Deep Learning
No ratings yet
Application of Data Augmentation On Deep Learning
13 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Wheeldon Et Al 2020 Learning Automata Based Energy Efficient Ai Hardware Design For Iot Applications
No ratings yet
Wheeldon Et Al 2020 Learning Automata Based Energy Efficient Ai Hardware Design For Iot Applications
18 pages
Generative Thermodynamic Computing
No ratings yet
Generative Thermodynamic Computing
5 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
Tensor Casting: Co-Designing Algorithm-Architecture For Personalized Recommendation Training
No ratings yet
Tensor Casting: Co-Designing Algorithm-Architecture For Personalized Recommendation Training
14 pages
Kernels, Data and Physics - Les Houches
No ratings yet
Kernels, Data and Physics - Les Houches
105 pages
A 1 TOPS - W Analog Deep Machine-Learning Engine With Floating-Gate Storage in 0.13 Μm CMOS
No ratings yet
A 1 TOPS - W Analog Deep Machine-Learning Engine With Floating-Gate Storage in 0.13 Μm CMOS
12 pages
Image Compression: Efficient Techniques for Visual Data Optimization
From Everand
Image Compression: Efficient Techniques for Visual Data Optimization
Fouad Sabry
No ratings yet
Impact of White Noise in Artificial Neural Network
No ratings yet
Impact of White Noise in Artificial Neural Network
7 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
Fnins 15 713054
No ratings yet
Fnins 15 713054
16 pages
Artificial Neural Networks: Introduction To Computational Neuroscience
No ratings yet
Artificial Neural Networks: Introduction To Computational Neuroscience
42 pages
Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy
No ratings yet
Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy
25 pages
Letters: Direct Neural-Network Hardware-Implementation Algorithm
No ratings yet
Letters: Direct Neural-Network Hardware-Implementation Algorithm
4 pages
v1 Covered
No ratings yet
v1 Covered
18 pages
GIRARDOT 2022 Archivage
No ratings yet
GIRARDOT 2022 Archivage
11 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
Exploring Neuromorphic Computing Based On Spiking Neural Networks: Algorithms To Hardware
No ratings yet
Exploring Neuromorphic Computing Based On Spiking Neural Networks: Algorithms To Hardware
49 pages
Analog Dialogue, Volume 47, Number 1: Analog Dialogue, #9
From Everand
Analog Dialogue, Volume 47, Number 1: Analog Dialogue, #9
Analog Dialogue
No ratings yet
A Study On Effects of Data Augmentation in Detection
No ratings yet
A Study On Effects of Data Augmentation in Detection
13 pages
"Transfer Learning" For Bridging The Gap Between Data Sciences and The Deep Learning
No ratings yet
"Transfer Learning" For Bridging The Gap Between Data Sciences and The Deep Learning
9 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Ali Thesis
No ratings yet
Ali Thesis
125 pages
Tianjic A Unified and Scalable Chip Bridging Spike-Based and Continuous Neural Computation
No ratings yet
Tianjic A Unified and Scalable Chip Bridging Spike-Based and Continuous Neural Computation
19 pages
Beating Transformers Using Synthetic Cognition: Abstract
No ratings yet
Beating Transformers Using Synthetic Cognition: Abstract
10 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
10.1515 - Nanoph 2020 0297
No ratings yet
10.1515 - Nanoph 2020 0297
12 pages
4154-Document Upload-15492-1-10-20241017
No ratings yet
4154-Document Upload-15492-1-10-20241017
10 pages
A Novel Neural Network Classifier Using
No ratings yet
A Novel Neural Network Classifier Using
11 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
DESIGN and DETAILING For TORSION - 'Guidelines and Rules For Detailing of Reinforcement in Concrete Structures' - by Chalmers
No ratings yet
DESIGN and DETAILING For TORSION - 'Guidelines and Rules For Detailing of Reinforcement in Concrete Structures' - by Chalmers
23 pages
Using Machine Learning For Heart Disease Prediction: February 2021
No ratings yet
Using Machine Learning For Heart Disease Prediction: February 2021
15 pages
CHEM2112 General Chemistry 1 First Quarter Exam 50 PDF
No ratings yet
CHEM2112 General Chemistry 1 First Quarter Exam 50 PDF
13 pages
Scary Fucking Stories - Noble, D - F - 2014 - Riot Forge - Anna's Archive
No ratings yet
Scary Fucking Stories - Noble, D - F - 2014 - Riot Forge - Anna's Archive
131 pages
Linear Algebra Review
No ratings yet
Linear Algebra Review
41 pages
A Comprehensive Review On Pharmacological and Ayurvedic Aspect of Phyllanthus Emblica (Amalki)
No ratings yet
A Comprehensive Review On Pharmacological and Ayurvedic Aspect of Phyllanthus Emblica (Amalki)
8 pages
Examination: (Look, Listen, Feel !!!!)
100% (1)
Examination: (Look, Listen, Feel !!!!)
13 pages
What Is Photosynthesis?: By: Science and Technology Concepts Middle School
No ratings yet
What Is Photosynthesis?: By: Science and Technology Concepts Middle School
2 pages
What Are 10base2, 10base5 and 10BaseT Ethernet LANs 10Base2-An
No ratings yet
What Are 10base2, 10base5 and 10BaseT Ethernet LANs 10Base2-An
10 pages
Flow Visualization State-Of-The-Art Development of Micro PIV
No ratings yet
Flow Visualization State-Of-The-Art Development of Micro PIV
18 pages
Kuhs Question Paper Final Year
No ratings yet
Kuhs Question Paper Final Year
1 page
F2 - C2 Ecosystem
No ratings yet
F2 - C2 Ecosystem
60 pages
M Family Spirit WEBRip NF en
No ratings yet
M Family Spirit WEBRip NF en
39 pages
Attachment 1 Concrete Rebound Hammer
No ratings yet
Attachment 1 Concrete Rebound Hammer
20 pages
Deductive Tasting Grid March 2022
No ratings yet
Deductive Tasting Grid March 2022
3 pages
Module 5 PDF
No ratings yet
Module 5 PDF
2 pages
Automotive Business Review June 2010
No ratings yet
Automotive Business Review June 2010
80 pages
6-SPB Series: 6-SPB-100 Spiral Pure Lead Battery
No ratings yet
6-SPB Series: 6-SPB-100 Spiral Pure Lead Battery
2 pages
Chloride Attack On Stainless Steel
No ratings yet
Chloride Attack On Stainless Steel
7 pages
FB All InstallationGuide AFL
No ratings yet
FB All InstallationGuide AFL
63 pages
Dev Patel - Puritan CHQ 2023kg PDF
No ratings yet
Dev Patel - Puritan CHQ 2023kg PDF
3 pages
Charge Design and Their Effect On Pressure Wave in Gun
No ratings yet
Charge Design and Their Effect On Pressure Wave in Gun
64 pages
The Bioenergetic Model in Osteopathic Diagnosis and Treatment: An FAAO Thesis, Part 1
100% (1)
The Bioenergetic Model in Osteopathic Diagnosis and Treatment: An FAAO Thesis, Part 1
9 pages

Frai 04 699148

Uploaded by

Frai 04 699148

Uploaded by

ORIGINAL RESEARCH

published: 09 September 2021

Enabling Training of Neural Networks

IBM Research AI, Yorktown Heights, NY, United States

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 1 September 2021 | Volume 4 | Article 699148

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 2 September 2021 | Volume 4 | Article 699148

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 3 September 2021 | Volume 4 | Article 699148

the two. As shown in Ref Onen et al. (2021), the non-symmetric

FIGURE 2 | Schematics of SGD, Tiki-Taka, and TTv2 dynamics.

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 4 September 2021 | Volume 4 | Article 699148

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 5 September 2021 | Volume 4 | Article 699148

During the matrix-vector multiplications, we inject additive

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 6 September 2021 | Volume 4 | Article 699148

SGD, even when the mat-vec noise on A is at σ MV 0.06.

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 7 September 2021 | Volume 4 | Article 699148

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 8 September 2021 | Volume 4 | Article 699148

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 9 September 2021 | Volume 4 | Article 699148

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 10 September 2021 | Volume 4 | Article 699148

TABLE 1 | Inference results of various models on different hardware.

Model–I, Ix, Iavg Model–II, IIx, IIavg Model–III,IIIx, IIIavg

Inference on Analog Hardware

FP Baseline Model: 1.315–1.332.

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 11 September 2021 | Volume 4 | Article 699148

TABLE 2 | Inference results of pruned networks for Model-III on digital hardware.

(No pruning) 0 1.326 1.326

TABLE 3 | Inference results of disturbed networks for Model-III on analog hardware.

Signal-to-noise used for Carefully disturbed network Randomly disturbed network

(No disturbance) 1.370 1.370

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 12 September 2021 | Volume 4 | Article 699148

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 13 September 2021 | Volume 4 | Article 699148

Frontiers in Artiﬁcial Intelligence | www.frontiersin.org 14 September 2021 | Volume 4 | Article 699148

You might also like