0% found this document useful (0 votes)
43 views

Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory

Uploaded by

Aditya Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory

Uploaded by

Aditya Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Analog architectures for neural network

acceleration based on non-volatile memory


Cite as: Appl. Phys. Rev. 7, 031301 (2020); https://doi.org/10.1063/1.5143815
Submitted: 06 January 2020 • Accepted: 10 June 2020 • Published Online: 09 July 2020

T. Patrick Xiao, Christopher H. Bennett, Ben Feinberg, et al.

COLLECTIONS

Paper published as part of the special topic on Brain Inspired Electronics

This paper was selected as Featured

ARTICLES YOU MAY BE INTERESTED IN

A comprehensive review on emerging artificial neuromorphic devices


Applied Physics Reviews 7, 011312 (2020); https://doi.org/10.1063/1.5118217

Reliability of analog resistive switching memory for neuromorphic computing


Applied Physics Reviews 7, 011301 (2020); https://doi.org/10.1063/1.5124915

Brain-inspired computing with memristors: Challenges in devices, circuits, and systems


Applied Physics Reviews 7, 011308 (2020); https://doi.org/10.1063/1.5124027

Appl. Phys. Rev. 7, 031301 (2020); https://doi.org/10.1063/1.5143815 7, 031301

© 2020 Author(s).
Applied Physics Reviews REVIEW scitation.org/journal/are

Analog architectures for neural network


acceleration based on non-volatile memory
Cite as: Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815
Submitted: 6 January 2020 . Accepted: 10 June 2020 .
Published Online: 9 July 2020

T. Patrick Xiao,a) Christopher H. Bennett, Ben Feinberg, Sapan Agarwal, and Matthew J. Marinellab)

AFFILIATIONS
Sandia National Laboratories, Albuquerque, New Mexico 87185-1084, USA

Note: This paper is part of the special collection on Brain Inspired Electronics.
a)
Author to whom correspondence should be addressed: [email protected]
b)
Electronic mail: [email protected]

ABSTRACT
Analog hardware accelerators, which perform computation within a dense memory array, have the potential to overcome the major
bottlenecks faced by digital hardware for data-heavy workloads such as deep learning. Exploiting the intrinsic computational advantages of
memory arrays, however, has proven to be challenging principally due to the overhead imposed by the peripheral circuitry and due to the
non-ideal properties of memory devices that play the role of the synapse. We review the existing implementations of these accelerators for
deep supervised learning, organizing our discussion around the different levels of the accelerator design hierarchy, with an emphasis on cir-
cuits and architecture. We explore and consolidate the various approaches that have been proposed to address the critical challenges faced by
analog accelerators, for both neural network inference and training, and highlight the key design trade-offs underlying these techniques.
C 2020 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://
V
creativecommons.org/licenses/by/4.0/). https://doi.org/10.1063/1.5143815

TABLE OF CONTENTS B. Driving the crossbar inputs . . . . . . . . . . . . . . . . . . . . 12


I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Analog voltage levels . . . . . . . . . . . . . . . . . . . . . . 12
II. COMPUTATIONAL PRIMITIVES IN DEEP 2. Analog temporal encoding . . . . . . . . . . . . . . . . . 13
LEARNING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3. Digital temporal encoding. . . . . . . . . . . . . . . . . . 13
A. Inference and training of deep neural networks . . 2 4. Input bit slicing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
B. Recurrent and convolutional neural networks . . . . 3 C. Analog-to-digital converters . . . . . . . . . . . . . . . . . . . 14
C. Processing large neural networks . . . . . . . . . . . . . . . 4 D. The neuron function . . . . . . . . . . . . . . . . . . . . . . . . . 15
III. DIGITAL NEUROMORPHIC ARCHITECTURES . . . . 5 VI. ARCHITECTURES FOR INFERENCE . . . . . . . . . . . . . . 16
IV. ANALOG IN-MEMORY ACCELERATION OF A. Synaptic bit slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
NEURAL NETWORKS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 B. Reducing ADC overhead . . . . . . . . . . . . . . . . . . . . . . 17
A. Candidate memory devices . . . . . . . . . . . . . . . . . . . . 7 C. Signed computation . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1. Device requirements for inference and D. Hierarchical organization . . . . . . . . . . . . . . . . . . . . . 18
training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 E. Convolutional neural network inference
2. Emerging non-volatile memories . . . . . . . . . . . . 8 acceleration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3. Floating-gate and charge-trap memory. . . . . . . 9 F. Flexibility and reconfigurability. . . . . . . . . . . . . . . . . 20
4. VMM with volatile capacitive memories . . . . . 10 G. Performance comparison of inference
B. Access devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 accelerators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
C. Limitations on crossbar size . . . . . . . . . . . . . . . . . . . 11 VII. ARCHITECTURES FOR TRAINING . . . . . . . . . . . . . . . 22
V. PERIPHERAL CIRCUITS IN ANALOG A. Supporting backpropagation . . . . . . . . . . . . . . . . . . . 22
ACCELERATORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 B. Parallel outer product update . . . . . . . . . . . . . . . . . . 23
A. Analog vs digital routing . . . . . . . . . . . . . . . . . . . . . . 12 C. Training with batch size > 1 . . . . . . . . . . . . . . . . . . . 23

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-1


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

D. Training convolutional neural networks . . . . . . . . . 24 covered these accelerators from a more device-oriented point of
E. Compensation for device imprecision and view.6–9 We also narrow our attention to accelerators for deep super-
asymmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 vised learning, which is the most well-researched and deployable set of
VIII. MITIGATING ARRAY- AND DEVICE-LEVEL neural architectures, rather than systems specialized for unsupervised
NON-IDEALITIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 learning algorithms or spiking neural networks. Resistive crossbars
A. Parasitic resistance . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 have notably also been explored for accelerating several related com-
B. Handling sparse neural networks . . . . . . . . . . . . . . . 26 putational kernels, such as solving linear systems10,11 and combinato-
C. Hardware robustness to noise, drift, and device rial optimization,12,13 which will not be the focus of this review.
failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Somewhat differently from recent surveys14–16 of neural network
D. Device-aware neural network training . . . . . . . . . . 27 accelerators based on emerging devices, we organize this review around
IX. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 the basic components and ideas that make crossbar-based architectures
I. INTRODUCTION work. These ideas are partially but not entirely agnostic of the specific
choice of memory device. We begin with an overview of deep learning
For decades, processor performance has advanced at an inexora- and the needs of a deep learning hardware accelerator (Sec. II) and then
ble pace by riding on continued increases in transistor density, enabled briefly discuss digital architectures specialized for this application
by Dennard scaling, and, more recently, by running many processor (Sec. III). In Sec. IV, we introduce the analog, crossbar-based neuromor-
cores in parallel. In the last decade, this performance has come to a phic accelerator and briefly survey the field of candidate memory devi-
plateau, as power constraints have slowed the scaling of transistors ces. Section V discusses the peripheral circuits, analog and digital, which
and as Amdahl’s law has diminished the returns from multi-core com- support crossbar computations—if not carefully designed, these circuits
putation using general-purpose processors. Significant future improve- can greatly offset the basic energy, latency, and area advantages of the
ments in computation—measured as the raw performance (operations approach. In Secs. VI and VII, we discuss the architecture of analog
per second), performance per area, and performance per Watt—will accelerators for neural network inference and training, respectively,
increasingly come from innovative domain-specific architectures that with an emphasis on system-level design trade-offs. Finally, in Sec. VIII,
are specialized toward a narrower set of tasks.1 we survey some known approaches to combat device- and array-level
At the same time, the fields of neuromorphic computing and non-ideal effects using architectural and algorithmic techniques.
machine learning—which have applications in image and speech rec-
ognition, natural language processing, predictive analytics, and many II. COMPUTATIONAL PRIMITIVES IN DEEP LEARNING
other domains—have recently gained considerable scientific and com- In this section, we provide a brief introduction to artificial neural
mercial interest. Modern neural networks have grown so quickly in networks trained by the widely used backpropagation algorithm for
scale that training and deploying them now often require computa- supervised learning. We focus our attention on the main linear-
tional resources that lie outside the reach of most researchers and indi- algebra computational primitives that underlie these algorithms, the
vidual edge devices.2 Fortuitously, neural networks are well suited to acceleration of which has motivated much of the recent work on ana-
acceleration by domain-specific hardware, as most neuromorphic log neuromorphic hardware. For a more complete pedagogical intro-
algorithms require only a small set of critical computational kernels. duction to deep learning and other machine learning algorithms, we
There is already a proliferation of new architectures specifically aimed refer the reader to Refs. 17 and 18.
at accelerating neural networks, many of which are custom digital
chips.3,4 For modern machine learning workloads involving large data- A. Inference and training of deep neural networks
sets, however, these systems are critically constrained by the need to The topology of a typical artificial neural network with a single
transfer data between memory and the processor.5 hidden layer is shown in Fig. 1(a). Input data to be processed (e.g., pix-
The so-called memory wall, also called the von-Neumann bottle- els of an image or an audio sequence, cast as a 1D vector in this case)
neck, presents an opportunity for neuromorphic accelerators that can are forward-propagated through the network from left to right by a
perform computations directly inside the memory array where the sequence of linear and nonlinear operations. The input neuron activa-
network’s parameters are stored. Analog processing inside such an tions xðlÞ to layer l are converted to the next layer’s activations xðlþ1Þ
array can also inherently parallelize the matrix algebra computational by the transformation
primitives that underlie many machine learning algorithms. These
intrinsic advantages can potentially translate into dramatic improve- xðlþ1Þ ¼ f ðWðlÞ xðlÞ Þ; (1)
ments in both the computational capacity and energy efficiency of where WðlÞ is an Nl  Nlþ1 matrix of weights (or synapses) connecting
neuromorphic hardware. layer l with Nl neurons to layer l þ 1 with Nlþ1 neurons. f is a nonlin-
The basic processing element in such a system is an array (or ear activation function that is applied element-wise to its argument,
crossbar) of memory devices. Many of the candidate devices for this which is the product of a vector–matrix-multiplication (VMM). A bias
application are emerging memory technologies, though it is also vector bðlÞ is also added to the argument of f; for brevity, we have cho-
possible—and often advantageous—to leverage mature technologies sen not to include it explicitly here, as it can be absorbed into WðlÞ .
such as flash memory. In this review, we discuss the practical consider- The layer-to-layer transformation in Eq. (1) maps one represen-
ations, unique challenges, and the diversity of possible implementa- tation of the input, encoded in the neurons x, to another. For example,
tions of a neuromorphic accelerator that is constructed from these the pixel-wise representation of an image may be converted to a
memory arrays. We will focus primarily on considerations at the level hidden-layer representation in terms of a more abstract set of features;
of circuits and system architectures, as several reviews have recently these features are propagated forward, being mapped to higher-level

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-2


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

earlier VMM, we will refer to this operation as a matrix–vector multi-


plication (MVM).
From the error vectors, the derivative with respect to the predic-
tion error d can be found for all the weights, and these derivatives are
then used to update the weights according to an optimization algo-
rithm, which is typically a form of gradient descent. The update to be
applied to a weight matrix is given by
T
DWðlÞ ¼ gdðlÞ ðxðlÞ Þ ; (3)

where g is a learning rate. Thus, the optimal update to the weights fol-
lowing a single example is an outer product of two vectors.
In stochastic gradient descent (SGD), the weight update in Eq. (3)
is carried out after every training example. To speed up training and
improve convergence, one can also perform forward propagation using
the same weights over the full training set and then apply a sum of
outer products as given by Eq. (3)—this is called batch gradient descent
or minibatch gradient descent if only a subset of the training set is used
before each update. Hyperparameters, such as the number and sizes of
the hidden layers and the learning rate (g), are typically fixed during
training but may need to be tuned externally as part of the neural net-
FIG. 1. A multi-layer perceptron neural network, showing a single hidden layer: (a) work design. The ultimate goal in both training and hyperparameter
forward propagation through the network and (b) backpropagation of error through
tuning is not necessarily to minimize the error on the training set but
the network.
rather to generalize well to a test set, which contains examples that the
features, until one reaches the final-layer neurons, whose values repre- network has not seen. Although SGD and minibatch gradient descent
sent the desired output from the neural network (e.g., classification tend to have noisier convergence during training than batch gradient
labels). The purpose of having one or more hidden layers in a deep descent and other deterministic algorithms that utilize the full gradi-
neural network is to convert the input into representations that are ent,19 they have been shown to yield good generalization performance
more amenable to making a correct prediction. on large-scale learning problems by virtue of their stochastic nature,
The nonlinear function f is essential in ensuring that the multiple which has a regularization effect for some datasets.17,20
layers in the network cannot be trivially collapsed into an equivalent
single-layer linear network. For tasks such as computer vision, a com- B. Recurrent and convolutional neural networks
mon choice for f is the rectified linear unit (ReLU): f ðxÞ ¼ maxð0; xÞ. We have so far considered a multi-layer perceptron (MLP)—a
Other choices include the sigmoid function rðxÞ and the hyperbolic deep neural network consisting only of fully connected (FC) layers.
tangent tanhðxÞ, which are more commonly used in manipulating Recurrent neural networks are usually employed for speech and natu-
sequential data, e.g., in speech recognition. A different nonlinear func- ral language processing and can also be used for vision tasks. These
tion, such as a softmax, is usually applied at the final layer to normalize networks are similar in structure to an MLP, with the critical differ-
the output of the network. The propagation of input data through the ence that the weights are reused across inputs from multiple time
network to produce an output prediction is called the inference step. steps. The widely used long short-term memory (LSTM) network, for
In some implementations, a batch of input examples are collected into example, processes an input vector together with a state vector for the
the activations x so that inference can be executed as a sequence of current time step using a memory cell that combines multiple types of
matrix–matrix multiplications. activation functions f.21 For computer vision tasks, convolutional neu-
In supervised learning, the network is trained to make accurate ral networks (CNNs) are the preferred choice. In convolutional layers,
predictions by iteratively updating the weight matrices W (and the only nearby neurons are connected by nonzero weights, leading to a
biases b) so that its outputs approach the provided correct outputs for a structure that is more effective in capturing spatially local patterns in
selection of input examples called the training set. Backpropagation of visual data. MLPs, LSTMs, and CNNs represented 95% of the infer-
error, shown in Fig. 1(b), is the most widely used algorithm for training. ence workload in Google datacenters in 2016.22
For a given example, an inference step is first performed, and an error Figure 2 shows the structure of a convolutional layer: the input
d—typically a mean-squared or cross-entropy loss function—is calcu- and output feature maps are three-dimensional—x, y, and an addi-
lated on the network’s prediction based on the known correct output. tional channel dimension—and the weights are applied in the manner
By differentiating Eq. (1), the error at the output layer can be propagated of a sliding window across the input much like a spatial filter in stan-
backward through the network to find the error values at a layer l, dard image processing. Forward propagation through a convolutional
T layer, analogous to Eq. (1), is given by
dðlÞ ¼ ðWðlÞ Þ dðlþ1Þ  f 0 ðWðl1Þ xðl1Þ Þ; (2) 0 1
0 NXic 1 K
X y 1 K
X x 1
where f is the derivative of f and  denotes an element-wise product.
Note that this expression involves the multiplication of the transpose Ox;y;m ¼ f @bm þ ISxþi;Syþj;k  Wi;j;k;m A; (4)
k¼0 j¼0 i¼0
of the weight matrix with an error vector: to differentiate it from the

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-3


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

FIG. 2. A convolutional layer. The colored


pixels are processed in one sliding win-
dow step over the x-y plane.

where I and O are the input and output feature maps and have Nic weights and 1-bit neuron activations.29,30 BinaryNet, a state-of-the-art
and Noc channels, respectively. W is the weight matrix that is repre- binary neural network, compresses AlexNet—a classic CNN designed
sented here with four dimensions: the filter plane has dimensions for the ImageNet task—by a factor of 189 while suffering only a
Kx  Ky and stride S, and an independent 2D filter exists for every small top-1 accuracy loss from 56.6% to 51.4%.31 These highly com-
combination of input and output channels. The input is often zero- pressed networks still require higher-precision values during backpro-
padded so that the output has the same size in x and y. Several convo- pagation since the errors and weight updates corresponding to
lutional layers are often followed by a max pooling layer, and a CNN individual training examples are small.32 Besides the compression of
generally has one or more FC layers at the output. It is important to weights and activations, there are also efforts to prune weights from
note that due to the small filter plane, a convolutional layer reuses the the network and in general to deploy less computationally intensive
same input data and a relatively small number of weights over many CNNs such as MobileNet, which uses depthwise convolutions to
sequential operations. Meanwhile, a fully connected layer typically greatly reduce the number of multiply accumulate (MAC) opera-
involves a much larger number of weights with no input data reuse tions.33 One caveat of these approaches is that while they reduce the
across its VMMs. Therefore, in any hardware implementation, convo- network’s memory footprint and computational load, they do not
lutional layers tend to be computation-bound, while fully connected always produce a proportional reduction in the energy needed to pro-
layers are bounded by the memory bandwidth.22 cess the network due to the overhead of data movement.34
The growing size and demand for neural networks, particularly in
C. Processing large neural networks industrial applications, call for hardware innovations in parallel with
As the machine learning field has advanced, the size of state-of- the algorithmic ones in order to make large, high-performance neural
the-art deep neural networks has grown dramatically. This is readily networks available to users and researchers who are constrained by the
seen in networks developed for computer vision tasks. While LeNet- cost of computation. Broadly speaking, large neural networks cannot
523—a classic CNN trained for the MNIST hand-written digit recogni- be implemented efficiently on general-purpose central processing units
tion task24—has less than 105 parameters, modern CNNs optimized (CPUs) in either the inference phase or the training phase. The CPU,
for the ImageNet Large Scale Visual Recognition Challenge25 have on which is specialized for executing one (or a few) potentially very com-
the order of 108 to 109 weights. Meanwhile, datacenter-targeted neural plicated instruction at a time, is ill-suited for the needs of large neural
networks such as Google Brain are several orders of magnitude larger networks, which are characterized by a large data volume and highly
still.26 The large size of state-of-the-art neural networks makes them regular workload built from a small set of computational primitives.
difficult to train without access to large computational resources, such Using graphics processing units (GPUs), which contain hundreds
as distributed computing clusters, and difficult to deploy for inference or thousands of co-processors that compute in parallel, significantly
applications on edge devices (such as mobile phones), which have lim- improves performance within the domain of neural network process-
ited storage, computational capacity, and power budget. ing. The GPU’s co-processors share access to a very fast local memory
Some progress has been made at the algorithm level to make or to a global memory in a highly parallelized (coalesced) manner.35
these neural networks more compact without greatly sacrificing their Many of today’s state-of-the-art neural networks are trained using
accuracy. One widely used approach is quantization:27 reducing the bit GPUs or distributed GPU clusters,2,22 and GPU-accelerated libraries
precision of the weights and activations. Quantization is significant as of the computational primitives in deep neural networks have been
it reduces not only the size of the network in memory but also the released, which facilitate code development for both inference and
energy and area costs of memory access and arithmetic computa- training.36 However, in spite of the advantages over CPUs, memory
tions.22 Good performance on image recognition tasks has been shown transfer remains an overwhelming bottleneck in processing large neu-
with neural networks that use only 1-bit (binary) weights28 during ral networks on GPUs.2,35 Since both the CPU and the GPU have lim-
inference and with fully binary neural networks using both 1-bit ited local memory and memory bandwidths, the growing data volume

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-4


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

in modern neural networks (as indicated by the sizes of the weights W each MAC unit receives an input and partial sum from its neighbors,
and activations x) implies that expensive off-chip memory accesses performs a computation, and then forwards the updated partial sum to
tend to dominate the energy and latency of the computation.5 the next element. This configuration leads to higher compute density
(25 more MAC units than the larger Nvidia K80 GPU) and reduces
III. DIGITAL NEUROMORPHIC ARCHITECTURES
the number of intermediate reads and writes to a buffer during a VMM.
To address the shortfalls of CPUs and GPUs for data-centric Access to off-chip main memory is the most significant bottle-
applications, there has been considerable recent interest in customized, neck to large neural network computations. One way to address this
domain-specific architectures for machine learning.1,3,4,26 Here, we issue is to place optimally sized memory banks close to the processing
will briefly survey these digital neural network accelerators, which elements for the transfer of intermediate results. DaDianNao, for
include both systems based on field-programmable gate arrays example, distributes large neural network layers over tiles of processing
(FPGAs) and application-specific integrated circuits (ASICs). We refer elements and provides local embedded dynamic random access mem-
the reader to Refs. 3 and 4 for a more complete overview. Many of ory (eDRAM) banks within each tile.44 3D integration of DRAM
these solutions aim to accelerate neural network inference, with train- arrays on top of a layer of digital logic elements, as in Neurocube,45
ing still largely performed offline by GPUs. can also effectively bring processing close to the memory.
The FPGA is a popular platform for the development of flexible An important approach to minimize memory access costs is to
hardware for various applications, including deep neural net- adopt a more efficient dataflow that maximally reuses the transferred
works.37–41 Compared to the GPU, its reconfigurability enables a data. For example, DianNao accommodates large layers, which do not
greater degree of algorithm and hardware co-design, as well as the pos- fit in the local memory dedicated to its processing elements, by break-
sibility of lower-precision datapaths that lead to more energy-efficient ing up the VMM into tiles that minimize the transfer of input and out-
inference operations. However, the overhead needed to support recon- put activations.46 In CNNs, the heavy reuse of filter weights and
figurability limits the available processing resources within a chip; feature maps provides a greater opportunity to amortize the cost of
multiple FPGAs may be needed to implement a desired function, data access over many operations. Eyeriss exploits this via a careful
increasing area and power consumption. These trade-offs can be bal- mapping of the convolution layer to the dataflow within the chip.47
anced by using collections of FPGAs interconnected by a high-speed Like the TPU, Eyeriss executes its operations on a systolic array but
reconfigurable network.42 At the datacenter scale, Microsoft’s Project with a reconfigurable row-stationary dataflow that minimizes the
Brainwave uses large pools of FPGAs to deliver inference results with movement of both filter weights and input data. For smaller filters, dif-
low latency. Brainwave’s high performance on large neural networks ferent parts of the array can be used to simultaneously process differ-
relies on efficiently partitioning the network across FPGAs such that ent parts of a convolution operation. Reference 3 provides a
each partition (or sub-graph) requires only those weights that can be comprehensive overview of energy-efficient dataflows used by various
pinned onto the high-bandwidth on-chip FPGA memory.43 digital architectures.
Google’s Tensor Processing Unit (TPU) is a custom chip that Inference accelerators have also been developed to operate on neu-
was designed for in-datacenter acceleration of neural networks.22 The ral networks with low-precision data types and/or sparse weight matrices
TPU v1, which is an inference accelerator, contains at its core a to save energy and area.48,49 In the limit of quantization to a single bit,
256  256 block of 8-bit multiply accumulate (MAC) units that per- YodaNN accelerates binary weight networks,50 the Unified Neural
form the individual multiplication and addition operations inside the Processing Unit (UNPU) supports variable-precision weights (1 to
VMM in Eq. (1). The block is operated as a systolic array, as shown in 16 bits),51 and BRein Memory accelerates networks with both binary
Fig. 3, where the input data are broadcast horizontally and VMM partial weights and activations.52 Specialized digital architectures have also been
sums are accumulated from the top downward. On each clock cycle, developed for deep neural network training—one example is ScaleDeep,
which is designed as a server node architecture.53 Cambricon, an instruc-
tion set architecture (ISA) specific to neural network accelerators,
increases code density while extending support to a variety of techniques
used in deep learning, for both inference and training.54
A distinct class of digital accelerators overcomes the memory
transfer bottleneck by directly performing computation inside a modi-
fied digital memory array, exploiting the high internal bandwidth of
DRAM and static random access memory (SRAM) technology.
Ambit, an example of a digital processing-in-memory (PIM) architec-
ture, implements a logically complete set of operations within a
DRAM array with minimal additional area overhead.55 It uses the
simultaneous activation of three rows in a DRAM array to perform
parallelized AND and OR operations on two rows and adds a row of
dual-connected 2T-1C cells to implement a row-wise NOT operation,
as shown in Fig. 4. To avoid overwriting the operand data, data must
first be copied onto these designated rows for bitwise logic, which adds
latency. DRISA (DRAM-based Reconfigurable In-Situ Accelerator)
FIG. 3. The systolic array of 8-bit multiply accumulate units in the Google TPU considers completing Ambit’s AND/OR operations with simple
v1.22 CMOS logic gates (e.g., NOT) integrated with the column sense

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-5


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

FIG. 4. (a) Layout of a DRAM array. (b) In Ambit,55 three designated rows can be
simultaneously activated to perform AND or OR operations on rows A and B,
depending on the values on row C. A row of dual-connected 2T-1C cells can be
FIG. 5. The basic concept of a massively parallel analog vector–matrix multiplica-
used to perform a row-wise NOT operation. WL ¼ world line and BL ¼ bitline.
tion within a resistive memory crossbar. The currents accumulated by the columns
are given by Eq. (5).
circuitry, as well as arrays that can perform logically complete NOR
operations using 3T-1C DRAM cells.56 whose conventional application is high-density data storage. To per-
While the bitwise operations in Ambit and DRISA are logically form a VMM within the array, all the rows (or wordlines) are activated
complete, many cycles are needed to execute multi-bit arithmetic simultaneously, with a voltage Vi applied on row i. The total current
operations—hundreds of cycles, for example, to compute multiplications collected by the jth column (or bitline), which we assume to be held at
with three bits or more.56 Consequently, high performance must be a fixed potential, is given by
achieved through parallelism, not only via the simultaneous activation of
multiple rows but also of multiple DRAM subarrays. More importantly, X
N r 1

PIM architectures work most efficiently as inference accelerators for net- Ij ¼ Gij Vi 0 < j < Nc  1; (5)
works where either the weights or activations are binary. Deep neural i¼0

networks based on bulk bitwise operations can also be implemented in where Gij is the conductance of the memory element at the array posi-
SRAM arrays57–59 and in emerging non-volatile memories such as resis- tion (i, j), Nr is the number of rows, and Nc is the number of columns.
tive RAM, phase change memory (PCM), and spin-transfer torque The above expression implements a vector dot product, where the
(STT) magnetic RAM when operated in binary storage mode.60 multiplications are realized by Ohm’s law and the summation by
Digital processing-in-memory architectures share some similari- Kirchhoff’s current law. Since the currents flow through all the col-
ties and intrinsic advantages with the analog non-volatile-memory- umns in parallel, the crossbar executes the full VMM in a single opera-
based architectures that we consider in the remainder of this paper. tion. The bias b can be added as an extra row in the array. The MVM
We refer the reader to Ref. 61 for a detailed neuromorphic perfor- operation, which uses the transpose of the same weight matrix, can
mance comparison of the two architecture types. likewise be executed by driving the columns and reading the currents
on the rows. We defer discussion of the parallel outer product update
IV. ANALOG IN-MEMORY ACCELERATION OF NEURAL
to Sec. VII on training accelerators.
NETWORKS
The physical latency of a crossbar VMM is determined by the
By embedding neural network computations directly inside the time required to charge each row of the array. Since both the resis-
memory elements that store the weights, analog neuromorphic accel- tance and the capacitance of the row increase with the number of
erators based on non-volatile memory (NVM) arrays can greatly columns, the latency scales as OðN 2 Þ for an N  N array.
reduce the energy and latency costs associated with data movement. In Nonetheless, for metal interconnects used in the 14 nm CMOS
this section, we will also see that computation within these arrays is process, this RC time is only about 0.2 ns for a very large array
inherently parallel, and this can be leveraged to accelerate the neural with N ¼ 1024, which is orders of magnitude faster than the time
network primitives discussed in Sec. II—the VMM, the MVM, and the taken to compute the same VMM (106 MACs) on a digital proces-
outer product update—enabling efficient architectures for both infer- sor.62 This RC delay is sometimes treated as constant time in prac-
ence and training. tice, as it tends to be much smaller than the latency of the
Figure 5 shows the basic structure of a resistive memory array, peripheral circuits, such as the analog-to-digital converter (ADC)
realizable with various non-volatile memory (NVM) device types, and the digital-to-analog converter (DAC).

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-6


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

Compared to the digital PIM accelerators discussed in Sec. III, the same array as the hidden state input. This architecture can support
the resistive crossbar has a more favorable energy scaling. Computing other types of recurrent neural networks, such as gated recurrent units,
a VMM on a DRAM or SRAM array requires charging one row at a by modifying or reprogramming the digital circuit blocks.66
time and then one column at a time for each cell. By charging all the Convolutional layers can be mapped onto resistive crossbars by
rows in parallel, the energy of a VMM scales as OðN 2 Þ, compared to unrolling the input to each sliding window computation into a vector
OðN 3 Þ for a digital memory array.63 Resistive crossbars, particularly and decomposing the convolution into VMMs. The most widely used
those based on ReRAM technology, can also potentially be denser scheme, proposed in the ISAAC architecture,68 is shown in Fig. 6.
than SRAM or DRAM arrays, while being capable of storing multiple The unrolled sliding window input is applied to the rows, and the
bits of weight data per cell. It is often the case in analog accelerators, result in each output channel is collected at the columns. While indi-
however, that while the in-memory matrix computations are highly vidual sliding window computations are intrinsically parallelized by
efficient, the area and energy consumption is dominated not by the the crossbar, the full convolution still requires a large sequence of slid-
crossbar but by the peripheral circuitry.62 ing windows that can be processed concurrently only by replicating
Comparing Eq. (5) to Eq. (1), we find that the conductance Gij is the weights across multiple devices.
simply proportional to the represented weight Wij. Realistically, the
mapping from the weight matrix to the array conductances must A. Candidate memory devices
account for non-ideal effects at the device and array level, such as the
1. Device requirements for inference and training
tunable conductance range, I–V nonlinearity, influence of peripheral
circuitry, parasitic voltage drops across the array, and device-to-device To be useful for neuromorphic computing, non-volatile memory
process variations. An optimal mapping can be extracted from device devices must meet a number of requirements that are considerably
models or from a few measured parameters.64 It is also possible to cali- more stringent than those for storage-class memory,69 particularly if
brate for the combined effects of these non-idealities by streaming a these devices are to be used for training. With this being the case,
representative set of input data to the crossbar and minimizing the many different types of memory have nonetheless been proposed as
error in the outputs.65 Since conductances must be positive, Eq. (5) is— synaptic weights for neuromorphic computing. Here, we provide an
strictly speaking—compatible only with positive-valued weights. Real- overview of the device requirements and the main candidate technolo-
valued weights are typically implemented using two resistors per weight gies. We will keep our discussion relatively brief, as detailed device-
whose currents are subtracted, but various other methods have been focused reviews can be found elsewhere.6,7,14,16
explored. We will revisit the topic of signed computation in Sec. VI. For inference-only applications, the device must have at least two
Similar to a fully connected layer, all the weights within a long reliably distinguishable conductance states. Though not strictly neces-
short-term memory (LSTM) layer can be mapped onto a single cross- sary, memory elements with low cycle-to-cycle variability or noise in
bar. By driving the rows with the input vectors and the hidden state its stored weight value are desired in order to realize multiple bits of
vectors, all matrix computations within a time step can be executed in weight information in a single device. Provided that the read non-
one array read operation.66,67 The column outputs can then be digi- idealities, depicted in Fig. 7(a), are small, multi-bit devices can greatly
tized and processed by digital circuits that implement the LSTM acti- enhance the accelerator’s area and energy efficiency. The device must
vation functions; in the next time step, these results can be passed into also have long retention: any drift in the stored state, caused by charge

FIG. 6. Mapping a convolutional layer


onto a resistive crossbar. Each column
contains the full set of weights for one out-
put channel. The size of the array is
ðKx  Ky  Nic þ 1Þ  Noc .

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-7


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

the weight, i.e., the same magnitude of conductance change for the
same strength of the pulse, regardless of the initial state—this is a
demanding but essential requirement.6 Examples of nonlinear devia-
tions from this behavior are shown in Fig. 7(b). In general, an asym-
metric nonlinearity—a different response to positive and negative
update pulses—causes a greater accuracy degradation than a symmet-
ric nonlinearity. In the presence of asymmetric nonlinearity, an
attempt to fine tune the weights via alternating positive and negative
updates will tend to drive the weight value to zero.86 The degree to
which deviations from this ideal behavior affect neural network train-
ing has been explored in detail by several papers.86–88 It is possible to
compensate for some device shortcomings—particularly low precision
and asymmetry—at the architecture level, as we will review in Sec. VII.
Mitigation strategies at the architecture and algorithm level have also
been proposed for the other non-idealities, which we will survey in
Sec. VIII.

2. Emerging non-volatile memories


Table I provides a coarse comparison of the non-volatile memory
technologies that are good candidates for analog synapses. We provide
a brief overview of these device options below.
Resistive RAM (ReRAM) is a popular choice for crossbar acceler-
ators, as it is generally highly scalable (with a footprint that is poten-
tially limited only by the metal pitch62), back-end-of-line compatible
FIG. 7. (a) For reliable and accurate inference, memory devices with low read noise and can have a continuously variable conductance, low write energy,
and small drift are desired. For reliable and accurate training, devices must have low write latency, and high endurance.6 ReRAM devices are two-
both low read and write non-idealities. (b) The response of a memory device, with terminal structures with a variable-conductance layer that is typically a
and without non-idealities, to a sequence of identical positive update pulses, fol- metal oxide. The conductance responds to electrical pulses either
lowed by a sequence of identical negative updates of the same magnitude. through the formation and destruction of a conductive filament89,90 or
through the electric field-driven migration of oxygen vacancies
leakage, ion migration, structural relaxation, or other mechanism, through the oxide volume.91,92 Since these changes involve atomic dis-
should be absent or should occur over a sufficiently long timescale. An placements, ReRAM devices tend to suffer from high cycle-to-cycle
auxiliary but nonetheless important side-effect of drift is read noise write noise, high device-to-device variability, relatively high read con-
that increases over time as a result of variability in the rates of drift. ductance, and random telegraph noise (RTN) in the stored weight
Low conductance is also strongly desired to limit parasitic voltage value due to electron trapping, which more strongly affects the high-
drops along the interconnects, thus enabling larger VMMs within a resistance state than the low-resistance state.14 Several works have con-
crossbar.70 Cycle-to-cycle noise, inaccurate weight programming, and sidered scaling dense crosspoint ReRAM arrays to the third dimension
drift can all accumulate to pose a serious challenge to the signal-to- to extend the size of the neural network that can be mapped to the
noise ratio of VMM operations in large crossbar arrays. A recent resil- array93–95 and to increase the number of synaptic bit slices to enable
ience analysis of state-of-the-art convolutional neural networks—such greater weight precision.96
as AlexNet, VGG, and ResNets—demonstrates the varying degrees to Phase change memory (PCM) uses the energy delivered by a cur-
which these networks can tolerate weight and activation noise intro- rent pulse to cause a phase transition of a chalcogenide glass from a
duced by the physical hardware implementation during inference.71 high-resistance amorphous phase to a low-resistance crystalline phase.
To accelerate training, which involves more precise weight tun- Different regions of the device volume can be locally amorphous or
ing, more analog conductance states are needed; this requires devices crystalline, enabling many intermediate conductance levels. However,
with small variability in both its read and write operations. The device while gradual resistance reductions have been demonstrated (SET),
should have high endurance and low write latency so that it can be increasing the resistance (RESET) requires melting and quenching the
trained reliably and efficiently over many examples in a large dataset entire volume of material and is therefore abrupt6—this shortcoming
and should also have low write energy. Though the memory technolo- can be mitigated at the array level, as we will discuss in Sec. VII.
gies considered in Table I can each be switched with picojoule write Compared to ReRAM, PCM devices require a larger programming
energies, the write voltage and write current are also individually current76 and are susceptible to drift due to structural relaxation, par-
important: a low write voltage is desired to minimize the CV2 energy ticularly in the amorphous phase.97
needed to charge the array, and a low write current helps reduce write Electrochemical devices have recently been demonstrated, which
errors incurred due to array parasitic resistance. implement highly linear and symmetric synapses with a large number
To train accurately, the memory device should further have a of finely spaced states, which are ideal properties for training.81,98 By
gradual and linear response to the programming pulses used to update applying voltage pulses of the appropriate polarity to the gate of a

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-8


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

TABLE I. Comparison of presently available non-volatile analog memory technologies.

Resistive RAM Phase change memory Floating gate/charge trap memory Redox transistor Ferroelectric FET

Maximum resistance 1 MX 1 MX 1 GX 10 MX [70,72] 1 GX


Device area73 4F2 4F2 4–10F2 Large 4F274
Endurancea 10 cycles75
12
10 cycles76
12
105 cycles >109 cycles72 10 cycles74
9

Programmable resolutionb 8 bits77 5 bits78 7 bits79,80 9 bits81 5 bits82


Write current 1 lA [83] 100 lA [76] 1 pAc [84] 10 nA [72] …
Write speed ns ns ms ms,70 ls [72] ns
Update stochasticity High High Low Low Moderate
Update linearity Poor Poor Moderate Good Poor
Symmetric update? No No Variabled Yes No
a
These values indicate the number of digital write-erase cycles between the maximum and minimum conductance states. Analog synaptic updates can be much smaller, potentially
leading to higher endurance and lower write latency/energy for neuromorphic applications.
b
The values correspond to state-of-the-art programming precision in single devices, which may require the use of iterative write-verify schemes. Achieving the same resolution for
every device in an array is more challenging.
c
Refers to programming by Fowler–Nordheim tunneling, which uses very low current but a relatively high voltage (10 V).
d
The gate bias can be used to control whether the nonlinearity is asymmetric or close to symmetric.85

redox transistor structure, mobile ions can be injected or removed an MTJ.104 The challenge remains, however, of a low magnetoresis-
from the channel, modulating its carrier density and hence its conduc- tance on/off ratio, which reduces the effective bit resolution in the
tance. Low channel conductance (100 nS) and sub-microsecond write presence of noise.
times have recently been demonstrated in redox transistors integrated
with a ReRAM selector to enable long state retention.72 A remaining 3. Floating-gate and charge-trap memory
challenge with these devices is the difficulty of co-integration with
CMOS technology due to the added fabrication complexity and the The floating-gate transistor108 and the closely related charge-trap
memory,109 which are the basis for commercial flash memory, are also
temperature incompatibility of polymer device processing with back-
attractive options for neuromorphic computing owing to their mature
end-of-line forming anneal steps.99 In spite of this, their promising
fabrication technology. In these devices, the analog synaptic weight is
electrical properties may pave the way toward modified fabrication
stored as charge that resides on an electrically isolated internal elec-
steps to deal with these limitations or novel polymer and ionic blends
trode or within trap states in an insulator. The charge is added or
that can accommodate modern CMOS process flows.
removed via Fowler–Nordheim tunneling or hot-electron injection
A non-volatile memory with some similarities to flash memory is
through the surrounding oxide. Their advantages lie in their low vari-
the ferroelectric field-effect transistor (FeFET), which allows
ability, multiple bits per cell, and their access to a subthreshold regime
threshold-voltage programming by integrating a ferroelectric layer of operation with very low conductance, which enables scaling to
within the MOS gate stack. The degree of electric polarization of the larger crossbars. However, for in situ training, a significant drawback
ferroelectric layer can be controlled using the electric field, provided lies in the large voltage (for tunneling), large current (for hot-electron
by short voltage pulses with very small current. While a symmetric injection), and long write times incurred by the programming pulses.
update response with multiple conductance states has been demon- There are several options for integrating floating-gate devices in a
strated, training is complicated by the fact that different ferroelectric dense array for VMM acceleration, which we summarize below.
domains within the layer require different switching voltages, which Floating-gate devices can be integrated into a crossbar much like
necessitates a more complicated write scheme with variable-amplitude two-terminal variable resistors as shown in Fig. 8(a), with the input
or variable-length pulses.82,100 FeFETs are potentially more attractive voltage applied across the source and drain terminals to execute a
as VMM engines for inference101,102 but may be practically limited in VMM through Ohm’s law.85,110,111 This configuration allows the
this case by their comparatively short retention, which arises from device to be operated in various regimes of operation—subthreshold,
charge leakage through the gate and the presence of a parasitic depo- triode, or saturation—depending on the gate voltage applied during
larization field.103 read. In training accelerators, this scheme is also compatible with the
Magnetic tunnel junction (MTJ) based memories, including parallel outer product update, which we describe in Sec. VII.85
spin-transfer torque magnetic random access memory (STT-MRAM), Alternatively, by operating the transistors as programmable cur-
are well suited for storage due to their high density, high endurance, rent mirrors as shown in Fig. 8(b), the inputs to a VMM can be
low write energy, and fast write speeds. However, because STT- applied via the gate terminal.105,112–114 In the subthreshold regime, the
MRAM traditionally holds only two magnetoresistance states, it can weight can be implemented as a scaling factor between two currents,
be used only as a digital or binary synapse rather than an analog one. rather than through Ohm’s law,
More recently, multi-level storage has been demonstrated in magnetic  
devices that use spin-transfer torque to modulate the position of a Iij Vfg;ij  Vfg;ref
Wij ¼ ¼ exp j ; (6)
magnetic domain wall, which can then be read out electrically using Ii kT=q

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-9


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

across both the rows and columns. The temperature dependence in


Eq. (6) can be compensated using external circuitry.105,113 The gate-
coupled floating-gate configuration has been used to demonstrate
moderately large crossbar inference systems for MNIST image classifi-
cation.106,115 The one-transistor-per-cell configuration has also been
shown to be programmable with minimal write disturb to neighboring
cells,80 potentially enabling training in a dense array.116
Three-dimensional NAND flash, used for high-volume data stor-
age, can also be used for in-memory VMM computation with large
neural network models. In such an architecture, floating-gate devices
are serially connected in vertical pillars, and VMMs can be executed
one 2D slice at a time. To avoid undesired sneak paths, the devices in
the other layers are left unselected by an appropriate voltage biasing
scheme.110,111,117 Compared to a standard 2D array, each 2D slice
within a 3D array must contend with greater disturbance from para-
sitic resistances and capacitances.

4. VMM with volatile capacitive memories


Some early analog accelerators proposed crossbar-based parallel
VMMs using volatile capacitive memories. For instance, Ref. 118
stores a 6-bit analog weight in the charge on a capacitor, whose prod-
uct with a 3-bit input is computed with a multiplying digital-to-analog
converter (MDAC) circuit; the currents are then summed using
Kirchhoff’s law. In Ref. 119, the weights are stored in binary DRAM
cells. When a binary input signal is applied to a row, charge is trans-
ferred from each cell to a MOS capacitor whose gate is connected to a
column line. The total amount of transferred charge among the devices
in a column can be capacitively sensed as a voltage change on the col-
umn line, which is digitized by an ADC. All rows can be activated
simultaneously to realize a parallel VMM. Afterwards, the charge can
be returned to the DRAM cell to restore the weight. To perform higher
precision computations, this work proposed bit slicing of both the
inputs and the weights, as we will define in Secs. V and VI, respectively.
References 120 and 121 also store an analog weight as the charge
on a trench capacitor, which in turn controls the resistance of a read-
out CMOS transistor operated in triode mode. The scheme requires
several additional transistors per unit cell to charge and discharge the
capacitor, offsetting the area efficiency of the crossbar. In general,
accelerators based on volatile memories suffer from low retention and
potentially require charge refresh circuits at the array crosspoints, lead-
ing to high footprint.118 This makes them impractical or expensive for
inference and greatly constrains their use as training accelerators.
Some of the more recent mixed-signal accelerators fully or par-
FIG. 8. VMM using a 1T array of floating-gate devices operated (a) as programma- tially execute the VMM in the analog domain using capacitive circuits
ble resistors85 and (b) as gate-coupled programmable current mirrors.105 In the lat- but do not employ the concept of a computational memory crossbar.
ter case, a voltage input must first be converted into a current; fully current-mode Ref. 122 uses strictly binary inputs and weights, which allows compact
operation can also be used.106,107 The floating gate voltage on the row drivers is efficient multiplications in the digital domain but computes the sum of
set to a uniform reference, Vfg;ref . these binary products using an array of switched capacitors rather
than a digital adder tree. In Ref. 123, switched capacitors integrated
into a successive-approximation-register (SAR) ADC perform the ana-
where Iij is the drain current through the floating-gate synapse, Ii is the
log VMMs, achieving a high energy efficiency of 13 fJ per multiplica-
input current on row i that is injected through a reference transistor,
tion using 3-bit weights and 6-bit activations.
Vfg is the voltage on the floating gate, j is the gate efficiency, T is the
temperature, k is the Boltzmann constant, and q is the electron charge.
The transmission of a gate voltage across the rows requires a minimal B. Access devices
amount of input current, reducing the susceptibility of the input signal To ensure that the conductance states of the memory elements
to parasitic voltage drops, but the device drain currents must still flow remain truly non-volatile during programming, the crossbar in Fig. 5

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-10


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

input as a voltage on a high-impedance access gate (i.e., on the select


line in Fig. 9), the signal can remain free from distortion caused by
parasitic resistance. The nonlinear characteristic of the transistor
implies, however, that this voltage can practically only be binary (bias
on or off). To communicate values with multiple bits of precision, the
input can be transmitted one bit at a time to the array, or the input
value can be encoded in the width of the input pulse, which modulates
how long the access transistor remains turned on. We will review both
of these approaches in Sec. V.

C. Limitations on crossbar size


For neuromorphic accelerators, an important architectural ques-
tion is: what is the largest realistic memory crossbar that can be used?
The numerical answer depends on the memory and access devices
that make up the crossbar and is in general determined by two princi-
pal considerations:
(1) Parasitic resistances in the crossbar: in large crossbars, the resis-
tive voltage drop across one full row or column can be signifi-
cant. This effectively reduces both the fidelity of the input
signal seen at the array element and the fidelity of the weight
FIG. 9. Resistive crossbar with one access transistor per cell (1T1R). To select a value seen at the output, and this effect is nonuniform across
device, its select line (SL, blue) is raised to a high voltage and its bitline (BL) is the array. It is possible to partially compensate for this effect
held to a low voltage. using strategies like nonlinear weight mapping, which we will
discuss in Sec. VIII. However, at some point, compensation
becomes impossible without increasing the input voltage; this
is typically augmented with access devices (Fig. 9), as required in
sets a limit on the crossbar size and strongly prioritizes non-
arrays used for storage-class memory. These devices are characterized
volatile synaptic devices with intrinsically lower conductances
by a highly nonlinear I–V relationship and serve as a barrier to current
to enable larger arrays.
conduction when the applied voltage is low but are transparent when
(2) The on/off ratio of access devices: if this ratio is insufficient, then
the voltage exceeds a threshold. Their primary purpose is to ensure
the crossbar suffers from large power dissipation during pro-
that when both the wordline and bitline (row and column) corre-
gramming and potentially also from write disturb. A reasonable
sponding to a selected device are activated during write, the other devi-
limit on crossbar size is the point when the total leakage in all
ces in the array see a voltage that is below the threshold of the access
the unselected and half-selected devices matches the current
device and pass only a minimal current—this is particularly important
flowing through the selected device. For example, with an On/
for half-selected devices, which share a row or column with the
Off ratio of 106, the limit on crossbar size is roughly 1000 
selected device. Failure to suppress current flow in these devices can
1000.126
lead to a disturbance of the stored state. Leakage currents accumulated
over the array can also dissipate significant power and cause voltage Other considerations may become important limiters of crossbar
drops across the rows and columns that can lead to programming size by degrading the accuracy of the neural network. For example, the
errors. The access device is typically a CMOS transistor, though highly effects of device-to-device variability and noise on the VMM result
nonlinear ReRAM devices have also been engineered to provide fast, generally increase with the number of rows or columns.88 For emerg-
accurate select operations.124,125 We refer to Ref. 126 for a review of ing memory devices, yield is also a practical concern. Many of the
access device design considerations and available technologies. experimental demonstrations of fabricated crossbars127–132 have not
3The access device converts the passive crossbars based on resistive been large enough to be severely constrained by these issues. To date,
two-terminal devices into 1T1R (1 transistor, 1 resistor) arrays, with a inference on the MNIST task has been demonstrated on several large-
third terminal to control the access device during writes. Some configu- scale crossbars: a 512  1024 binary ReRAM array,133 a 500  661
rations based on three-terminal devices, such as the floating-gate transis- PCM array,134 and a 785  128 floating-gate array.115
tors discussed previously, may be less sensitive to write disturb85 or may
not need any additional access devices if the biasing scheme during pro- V. PERIPHERAL CIRCUITS IN ANALOG ACCELERATORS
gramming can be optimized.80 A passive 12  12 ReRAM crossbar has The peripheral circuits accompanying the memory crossbar typi-
been experimentally demonstrated, which leverages the I–V nonlinearity cally comprise a large share of the energy consumption and area of an
of the memory device itself to avoid the use of access devices.127 analog neuromorphic accelerator, and in many cases, they also domi-
However, this approach would add to the already long list of require- nate the latency.62,68 Of these components, the process of converting
ments that the memory devices must meet, and the level of nonlinearity between analog and digital signals typically carries the largest energy
in the device is likely insufficient for crossbars of larger size. overhead. Since the analog accelerator interfaces with a digital proces-
Access transistors can also be used to more efficiently pass VMM sor at its input and output, these conversion steps are always necessary
inputs to the memory elements during inference. By applying the at the edges of the system (unless the input is received in the analog

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-11


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

domain directly from a sensor, as in Ref. 135). Communication Some papers circumvent the need for an ADC by representing
between crossbars, however, can be carried out in the analog or digital data as a pulse with fixed amplitude but variable duration.137,141 By
domain. encoding the signal in the time domain, this approach provides some
immunity to noise and distortion that primarily affects the signal
A. Analog vs digital routing amplitude. We will look more closely at this scheme in Sec. V B 2.
Most architectures opt for digital routing between crossbars, as it
allows the use of dense and efficient on-chip digital interconnection B. Driving the crossbar inputs
networks138 and because analog signals are prone to degradation by One of the primary differentiators between analog architectures
noise and distortion. As a case study, HP’s Dot Product Engine is the way in which they represent the input signals to a VMM opera-
(128  128 crossbar) was evaluated in simulation using both analog tion in order to drive the rows of a memory crossbar. We review sev-
routing—with buffers and repeaters—and digital routing, which inter- eral categories of input representations below.
faces with the crossbars via ADCs and DACs.65 The authors found
that accuracy on the MNIST task degraded more strongly with analog 1. Analog voltage levels
routing compared to digital routing with just four bits of resolution in
the ADCs/DACs. In the analog approach, errors in the memory array The most conceptually direct way of implementing the VMM in
due to noise, nonlinearity, and variability accumulate from layer to Eq. (5) is to encode an element xi of the input in the amplitude of a
layer, whereas the quantization step in the ADC effectively limits the voltage signal Vi, as shown in Fig. 10(a).65,129,136 For an input with Bin
propagation of these errors between layers. In RENO, the authors take bits of precision, this method requires a Bin -bit DAC to supply the 2Bin
advantage of the short distance between crossbars ( 0.5 mm) to possible analog voltage levels to a crossbar row or a ðBin  1Þ-bit DAC
transmit data in analog form though a network of analog sample and if one of the bits determines the sign of the voltage pulse. The advan-
hold (S&H) buffers and switches.139,140 Accounting for circuit noise tage of this method is that the full VMM can be completed in a single
and device process variations, they simulate a classification accuracy crossbar read operation, and the latency does not scale with the preci-
loss of several percent for a few small tasks (using one to two hidden sion of the input. However, the area and energy consumption of the
layers) relative to a comparable system with fully digital processing DAC could scale exponentially with Bin . While the overhead of a single
and routing. The higher accuracy of the digital approach came at the DAC scales similarly to that of an ADC, the DACs cannot be shared
cost of a lower energy efficiency and higher latency.139 across the inputs (which must be driven simultaneously) in the same

FIG. 10. Four different schemes for representing the crossbar input signal and the associated peripheral circuitry in the signal path: (a) voltage amplitude encoding,65,136 (b)
analog temporal encoding,134,137 (c) digital temporal encoding,62 and (d) input bit slicing.12,68 The transimpedance amplifier (TIA) can be replaced by a different current-to-volt-
age converter, such as a sample-and-hold circuit.

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-12


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

way that a high-precision ADC can be shared or multiplexed over the reminiscent of a dual-slope ramp ADC147 but without the overhead of
outputs (as we discuss in Sec. V C). The output currents on the col- a digital counter.
umns can be converted into a voltage using a transimpedance ampli-
fier (TIA),65 a sample-and-hold or integrator circuit, or with similar 3. Digital temporal encoding
compact sensing circuitry.142,143 The analog voltage output is then
converted to a digital signal using an ADC and then transmitted to the A variation of the temporal encoding approach maintains a digi-
next stage. tal output, allowing for digital processing and signal routing between
A significant drawback of this approach is the need for the memory arrays, and encodes the signal in the time domain only within the
elements to behave as ideal resistors with a highly linear I–V curve over crossbar computational core. At the input of an array, a digital logic
the range of possible Vi values; without this, the VMM is distorted and circuit can be used to convert an incoming digital signal into a voltage
pulse train, with one pulse per input bit.62 For each bit position b, the
becomes a nonlinear operation.6 Imposing I–V linearity on the already
pulse has a duration proportional to 2b , as shown in Fig. 10(c), and the
long list of device requirements for the non-volatile memory element
entire pulse train can be made negative to implement negative inputs.
introduces a device/material engineering challenge, which can be circum-
The column currents produced by this pulse train are accumulated on
vented by adopting one of the alternative input representations below.
an integrating capacitor and then digitized by an ADC. To maintain a
fixed potential on the column line independent of the accumulated
2. Analog temporal encoding charge, Ref. 62 uses a current conveyor integrator that provides a vir-
To eliminate the requirement of I–V linearity, each row can be tual ground at its input. In this approach, the ADC only needs to be
driven with a pulse having fixed amplitude (6V0 ) but variable dura- run once for all the input bits.
tion,134 as shown in Fig. 10(b). The input xi is encoded in the pulse In both the temporal encoding schemes, the length of the temporal
duration Ti, converting the resistive crossbar VMM equation to signal scales exponentially with the number of bits, making the approach
prohibitively slow when using high-precision inputs. However, if the
X
Nr 1
ADC delay is comparable or longer and VMM operations can be run in
Ij ¼ V0 Si Gij Ti ; (7) batch (e.g., in convolutional layers), the crossbar read latency can be hid-
i¼0 den by pipelining the analog VMM and ADC stages.68
where Si ¼ 61 is the sign of the pulse. If the pulse is strictly positive,
it can be applied to the gate of an access transistor in series with the 4. Input bit slicing
synaptic memory device. To implement both positive and negative
To avoid the use of analog voltages while maintaining a read
inputs, a set of switches on each row can be used to connect the unit
latency that is linear with the input resolution, a digital input xi can be
cells to a voltage of the desired polarity.144 The input can also be
passed one bit at a time into the crossbar using binary voltage pulses
applied directly to the gates of a floating-gate synapse cell, requiring
of fixed length. This scheme is shown in Fig. 10(d). The output of the
four devices per synapse to implement both real-valued inputs and
VMM with input bit slicing is found by
real-valued weights.105,112,137
!
Aside from the relaxed requirement on device I–V linearity, a bene- BX
in 1 X
N r 1
b ðbÞ
fit of the temporal approach is that it encodes the activation data in a Yj ¼ 2 Gij Vi ; (8)
non-digital form that is insensitive to voltage drops incurred in commu- b¼0 i¼0
nication between arrays, therefore bypassing the conventional analog-to- where
ðbÞ
Vi 2 f0; V0 g is the binary voltage pulse amplitude corre-
digital conversion step and potentially requiring no DAC at the input. sponding to the bth bit of the input xi. This VMM operation requires
However, the information remains susceptible to distortion of the pulse Bin sequential crossbar read iterations (inner sum), each of which
shape. This approach still requires precisely timed ADC-like circuitry to requires an ADC step. If bits are presented from the lowest to the high-
convert the voltage- or current-encoded VMM outputs from the cross- est significance, the exponential in Eq. (7) can be implemented by
bar into temporally coded signals. Since these circuits have finite tempo- shifting the digitized crossbar output one position to the right prior to
ral resolution, the pulse duration (and thus the VMM latency) scales adding the output for the next bit. Thus, the exponentially weighted
exponentially with the effective number of input bits represented. If any outer sum can be implemented using a digital shift-and-add circuit at
intermediate digital processing must occur between arrays, the temporal the periphery of the crossbar.12,68
signal must first be converted to a digital signal by an ADC. An advantage of this approach is that the 1-bit input signal does
In Refs. 141 and 145, the crossbar outputs are converted into not require a sophisticated DAC and in some cases can simply be
temporal signals by continuously comparing the column voltages to a applied to the gate of a transistor that enables or disables current flow
linearly ramped voltage signal: this has a similar overhead to a on the entire row.12 Positive and negative inputs can be realized without
column-parallel ramp ADC. In Refs. 137 and 146, the conversion is requiring a third voltage level by using a two’s complement digital repre-
performed using a two-phase approach. The summed current on each sentation.68 Additionally, the binary resolution of the input reduces the
column, which represents the dot product, is first integrated onto a required ADC resolution, as we will quantify in Sec. VI A.12,68
capacitor. In the second phase, charge is steadily added onto the capac- The recently proposed successive integration and rescaling (SIR)
itor over an interval T. When the capacitor voltage crosses a threshold, scheme, shown in Fig. 11, performs the outer sum in Eq. (7) in the ana-
the output pulse is triggered, which falls at the end of this interval: the log domain, which enables input bit slicing without having to run the
greater the charge accumulated during the first phase, the longer the ADC on every iteration.146 In SIR, the column currents for the bth input
pulse duration in the second phase. This functionality is somewhat bit are first integrated onto a capacitor as an analog sum of charge; the

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-13


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

area, though its slow speed means that it must be separately provided
to every column.149 A more detailed comparison of the scaling proper-
ties of ADCs, not explicitly tied to crossbar applications, can be found
in Ref. 150.
In a ramp ADC, the analog signal is compared to a linearly
increasing reference voltage that is produced by a ramp generator
circuit.147 The transition in the comparator output is detected and
triggers the value of a digital counter to be written to a register,
which stores the digital output. The long ramp time, which scales
as 2B , makes this type of ADC too slow for some applications, such
as sampling a fast waveform. However, the ramp ADC is uniquely
suited to the massively parallel comparisons needed for the neuro-
morphic crossbar. The entire array can share a single ramp genera-
tor and digital a counter, and only a separate comparator and a
register need to be provided to the individual columns,62 as shown
in Fig. 12(a). Therefore, the latency scales as Oð2B Þ and does not
increase with the number of columns. The energy cost is domi-
nated by the comparators: if their power consumption is constant
FIG. 11. The successive integration and rescaling scheme for implementing shift- during the ramp time, then the total ADC energy consumption for
and-add operations in the analog domain. (a) After the currents are integrated on a VMM scales as OðNc 2B Þ.
capacitor CI, it is connected to capacitor CD, which halves the total accumulated
charge on CI. (b) Timing diagram of a 4-bit VMM. After integrating the final bit, the
A SAR ADC uses a B-bit DAC to generate the reference vol-
voltage output is encoded as a pulse length using the two-phase conversion tages to which the input is compared. An internal logic circuit exe-
scheme described in Sec. V B. Reproduced with permission from Bavandpour cutes a binary search algorithm to find the correct digital output
et al., IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 28, 823 (2020). Copyright within B comparisons.151 Due to its large area,152 a single SAR
2019 IEEE.146 ADC is typically shared by all the crossbar columns via time multi-
plexing,68,153 as shown in Fig. 12(b). The latency of this approach
thus scales as OðBNc Þ. The power consumption and area of a SAR
change in the capacitor voltage is proportional to the dot product car- ADC, which uses a capacitive DAC, scale exponentially with the
ried out on that input bit. After the integration is complete, a switch number of bits,154 though the area scaling might be more favorable
connects the capacitor in parallel with a second identical capacitor. The with a floating-gate DAC.150 The total energy consumption by the
charge redistribution then reduces the voltage on the integrating capaci- SAR ADC for a single VMM step therefore scales as OðBNc 2B Þ,
tor by a factor of two. The capacitors are then disconnected, and the though the constant prefactor may differ dramatically from that
rows are driven by the next input bit. In this way, the integrating capaci- for the ramp ADC. In general, the desired resolution and the num-
tors store the intermediate results of the bit-sliced VMM, and ideally, ber of columns will determine which of the two architectures is
the capacitance does not need to scale with the desired bits of precision. faster and/or more energy efficient.
By performing analog shift-and-add operations, the SIR technique uses
the ADC only once for all the input bits. Since the column outputs now
carry more precision, a higher-resolution ADC is required to maintain
the same precision achieved by a sequence of digital shift-and-add steps.

C. Analog-to-digital converters
Since the crossbar executes its computation on analog signals,
most accelerators require the conversion of the VMM output into a
digital form that can be transmitted to the next layer of the neural net-
work or to a host CPU. In accelerators that maintain modest to high
precision in the activation values, the ADC can easily dominate the
energy consumption, area, and latency of the accelerator. The ADC
architecture and its resolution B must therefore be chosen carefully to
preserve the intrinsic gains of using the memory crossbar for neuro-
morphic computations. We will discuss the optimal choice for ADC
resolution in Sec. VI A. Here, we review the two types of ADCs most
commonly used by crossbar accelerators: the ramp ADC, as used in
Ref. 62 and PRIME,136 and the SAR ADC, as used in ISAAC,68
PUMA,148 and the memristive Boltzmann machine.12 For small reso-
lutions, a high-speed flash ADC has also been used.119,139 The delta- FIG. 12. Analog-to-digital conversion of the crossbar VMM outputs, using (a) a par-
sigma ADC is another option for high precision in a relatively small allelized ramp ADC and (b) a time-shared SAR ADC.

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-14


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

D. The neuron function Reference 155 points out that implementing a transcendental
The neuron activation function f is necessary for both inference function such as the sigmoid or tanh in the digital domain with full
and training but does not benefit directly from the energy efficiency, software-equivalent accuracy may require a higher ADC resolution
density, or computational parallelism of the memory crossbar. It is than the number of output bits from the activation function. The issue
therefore realized separately using an analog or digital functional unit. can be seen in Fig. 13(a)—when a nonlinear function f is mapped onto
In analog, the neuron is ideally implemented within the functionality a discretized version of the analog outputs from the crossbar, there is
of the already existing peripheral circuits such as the ADC—otherwise, an inherent problem with undersampling in the steeper regions of f
the neuron circuit should be compact, low-energy, and fast. Likewise, and oversampling in the flatter regions. Thus, for the same ADC reso-
in the digital domain, the neuron logic should be sized so that it nei- lution, there may be a slight advantage in precision by implementing
ther takes up too much area nor limits the system’s throughput. these functions in the analog domain. However, the neural network
Digital implementations of nonlinear functions such as the sig- accuracy loss resulting from the over/undersampling issue has not
moid or hyperbolic tangent typically store the function values12 or a been quantified.
piecewise approximation of the function46 in a lookup table. The larger The neuron activation can be implemented using the analog
the lookup table, which is usually implemented in SRAM, the greater peripheral circuitry of the crossbar, with the notable caveat that these
its area and energy cost. For the ReLU function, typically used in methods are incompatible with input and synaptic bit slicing, since the
CNNs, the logic block can simply check the sign bit of the input to function f is applied on the full-precision result of the VMM. The
multiplex between the input value and zero.136,156 Since these nonlin- ReLU (and bounded ReLU) function can be integrated within an ADC
ear functions are applied identically to many input values, they can be simply by setting the appropriate upper and lower bounds on the
executed in parallel using single-instruction multiple-data (SIMD) vec- input range.118 A binary threshold activation function, as used in
tor instructions. Wider instructions, however, come at the cost of a binary neural networks, can be realized at no additional overhead
larger neuron functional unit. The PUMA architecture uses smaller using column comparators and a reference voltage.157,158
functional units to execute the equivalent wide instruction over several The sigmoid and tanh functions are more complicated to accu-
cycles: this compromise offsets the parallelism of SIMD but still rately realize in the analog domain but can be implemented using an
reduces the overhead of instruction fetch and decode.148 ADC with nonlinear quantization: equal changes in the output

FIG. 13. Mixed-signal integration of a nonlinear activation function with an ADC. (a) The function is first coarsely mapped onto a nonlinear ADC, and then, (b) a linear ADC
interpolates within the selected input voltage range of the nonlinear ADC. (c) Lookup table implementation of the nonlinear ADC, which yields the most significant bits of the
output, and (d) interval detection circuitry and linear ADC, which yield the least significant bits. Reproduced with permission from Giordano et al., IEEE J. Emerging Sel. Top.
Circuits Syst. 9, 367–376 (2019). Copyright 2019 IEEE.155

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-15


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

correspond to unequal spacing in the input voltages.159 To mitigate TPU, for instance, performs large-scale inference tasks using 8-bit
the over/undersampling issue above, Ref. 155 splits the ADC/neuron weights.22 It has been shown that image recognition is possible with
overhead across two circuits: the neuron function is first coarsely sam- even deeper quantization of the weights,27 down to a single bit,28,29 but
pled using a reconfigurable nonlinear ADC, and then, a piecewise- with a possible loss of several percent points in inference accuracy.
linear approximation of the neuron function is more finely sampled Thus, for general-purpose inference accelerators, at least eight bits of
using a linear ADC. Their scheme is shown in Fig. 13. The nonlinear weight precision would appear to be a good target for high accuracy.
ADC returns the most significant bits (MSBs) of the output, and the Reference 71 evaluates the resilience of a number of state-of-the-art
input range corresponding to these MSBs is used as the range for the CNNs to low-precision weights and noisy activations during inference.
linear ADC, which returns the least significant bits (LSBs). By comparison, how much precision can be expected in the pro-
Alternatively, dedicated analog circuits have been proposed, which grammed state of a non-volatile memory device? For state-of-the-art
approximately implement the sigmoid function using the transfer ReRAM, the resistance has been programmed to 0.5% accuracy in a
function of a low-gain comparator160 and a more compact six- 4  4 array,77 though the accuracy is lower when writing a 128  64
transistor circuit.161 array.129 Similarly, for floating-gate transistors, the state-of-the-art
programming accuracy for the drain current is about 1% over multiple
VI. ARCHITECTURES FOR INFERENCE
orders of magnitude.79,80 These results correspond to 7–8 bits of
Due to the highly demanding device and circuit requirements for weight precision and are obtained using a write-verify scheme: after
accurate neural network training, the acceleration of inference is a every write pulse, the conductance of the device is monitored, and this
nearer-term application for analog neuromorphic accelerators. In this is used to adjust the polarity of the next write pulse, and so on until
use case, the neural network is first trained externally (e.g., using pro- the correct weight value is obtained to the desired precision.64,77 To
cessors on the cloud), and the trained weights are subsequently loaded speed up programming, variable-amplitude pulses are often used to
onto edge devices to perform inference on new examples. For non- provide both coarse and fine tuning of the weight value.77,162 While
volatile memory based accelerators, this means programming all the the write-verify process is potentially energy-intensive and time-
devices once, possibly followed by occasional re-programming to consuming, its cost is amortized over the useful lifetime of the neural
update the network weights or to combat component failures and network or over the time between scheduled model updates, depend-
drift. In this section, we review the main architectural considerations ing on the application and the retention properties of the memory
that govern the design of an analog inference accelerator. elements.
Often, it is desirable in analog accelerators to reproduce the neu- Therefore, eight-bit weight precision remains at the upper limit
ral network inference accuracy that would have been achieved by an of what can be realistically achieved today using a single non-volatile
equivalent digital computation at a given level of data precision. This memory device. To enable full-precision computation (including
choice, which we will call the full precision assumption (and some- floating-point precision163) with limited-precision memory elements,
times also called “software-equivalent” accuracy), can be realized by many architectures employ synaptic bit slicing. Similar to bit slicing of
meeting three criteria: (1) the weights are reliably stored at full preci- the input activations, as discussed in Sec. V B, the Bw bits of a synaptic
sion by the memory elements, (2) the ADC resolution matches the res- weight can be segmented into Nw slices with B ~ W ¼ Bw =Nw bits each,
olution of a digital VMM, and (3) the voltage corresponding to the chosen so that an analog memory element can reliably store the bits
least significant bit lies above the noise floor. The third requirement within a slice. The devices corresponding to different bit slices of the
can be met by appropriately setting the ADC range and integrating the same weight are spatially partitioned onto different columns on the
column currents for long enough to obtain an acceptable signal-to-
same row of a single crossbar12,68,136,143,164 as shown in Fig. 14 or onto
noise ratio.63 The first requirement is met by providing devices with
different crossbars.153 Bit slicing allows less ideal memory devices to
sufficiently high programmable resolution or, if not available, by slic-
be used, reduces the required ADC resolution, and offers greater pro-
ing the bits of a weight value across multiple memory devices—we will
tection against the effects of noise and process variations. The number
discuss this technique in Sec. VI A. The ADC resolution will be consid-
of slices is chosen based on these considerations and based on the area
ered in Sec. VI B.
overhead, which is significant: different accelerators have implemented
Some architectures forego the full precision assumption to
a single weight (of varying precision) using two,136 eight,68 and 32
achieve better energy efficiency or smaller area. This option may be
devices.12 When combined with bit slicing of the inputs, this has been
preferable at low to moderate activation precision (e.g., eight bits or
called an “internally analog, externally digital” approach to
less) and a weight precision that is well matched to the effective pro-
VMM.119,143 The VMM equation is now expressed using two digital
grammable precision of the individual devices. The main advantage of
summations and one analog summation,
this approach is the elimination of the computational and data move-
!
ment overhead—in the form of energy, area, and latency—associated BX X
in 1 N w 1 X
N r 1
bþc ðcÞ ðbÞ
with synaptic and/or input bit slicing and the subsequent reconstruc- Yj ¼ 2 Gij Vi ; (9)
tion of the VMM result. b¼0 c¼0 i¼0

where b indexes the input bit and c indexes the weight bit slice. The
A. Synaptic bit slicing outermost sum corresponds to summing over the input bits, which are
A limitation of non-volatile memory devices is the number of streamed sequentially in time, and the middle sum composes the
distinguishable conductance levels that are available to represent a syn- results from weight bits sliced over multiple columns or crossbars.
aptic weight. For inference, the required amount of precision in the Typically, the weight bit slices are aggregated first, followed by the
weights is generally smaller than that needed for training. The Google input bit slices.

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-16


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are


~ w þ d log2 Nr e
Bin þ B ~w > 1
if Bin > 1; B
Bout ¼ ~ w þ d log2 Nr e  1 otherwise: (10)
Bin þ B
As we have seen, it can be costly in energy, latency, and area to
add a bit of precision to the ADC. To keep the needed ADC resolution
low, bit slicing of the inputs and of the weights can be used reduce the
operand widths Bin and B ~ w , respectively. While this reduces the VMM
overhead on a single bit slice, it does require repeating the same opera-
tion on all input and weight bit slices. At higher activation precision,
where the exponential scaling with resolution of the ADC overhead
dominates, this tends to be a favorable trade-off.
We note that where input or synaptic bit slicing is used, Eq. (10)
gives the ADC resolution needed to maintain full precision in the indi-
vidual bit-sliced VMMs. Not all these output bits may be needed in
reconstructing the final VMM result at the desired activation preci-
sion. After combining these intermediate VMM results via shift-and-
adds, several of the least and/or most significant bits of the result are
discarded. The Newton architecture determines which of the output
bits from the individual bit-sliced VMMs contribute only to those bits
that are ultimately discarded and then uses this knowledge to sepa-
rately reduce the ADC resolution on individual arrays. This is made
possible by an adaptive SAR ADC whose components can be gated off
when fewer comparisons are needed than the available bits in the
FIG. 14. Column-wise synaptic bit slicing. Here, each weight (represented as an 8-
DAC.153 This approach reduces the energy overhead of the ADC but
bit integer) is implemented by four 2-bit memory devices spread across four cross-
bar columns, and the results are aggregated using a shift-and-add reduction tree. not the area.
ADC resolution can also be reduced at a small energy and area
overhead by storing the weight values in a simple encoded manner on
In the column-wise bit slicing approaches, the bits for a given the crossbar. In ISAAC, a column of weights is stored in a “flipped”
weight are spread out over several columns grouped together in a form (W  ¼ 2Bw  1  W) if the vector dot product with the maxi-
stripe, as shown in Fig. 14. The sum over the bit slices is performed mum possible input x yields a 1 in the most significant bit—this can
using a shift-and-add reduction tree, which aligns the VMM output be determined statically and reduces the ADC resolution by one bit.68
bits collected from different columns and then computes the In the case that the weights are sliced over 1-bit devices, Ref. 163 uses
sum.12,68,164 If the synaptic bits are sliced across different crossbars, the computational inverse coding to further reduce the ADC resolution by
output of each array contains more partial sums but fewer bits per one bit: if a column of weights has more 1’s than 0’s, its inverse is
sum: the shift-and-add operations are applied to operands originating stored and an ‘inverted’ bit is used to decode the output. This method
from different arrays and can be integrated with the local communica- works as long as the corner case is handled in which a column con-
tion fabric between crossbars.153 In both cases, to aggregate the VMM tains exactly as many 1’s as 0’s (by, for example, using an odd number
results from the input bit slices streamed over time, further shift-and- of rows).
add operations are applied to these sums, as shown in Fig. 10(d). In architectures that do not assume full precision, the resolution
Some energy savings are possible by modifying the way that the and range of the ADCs must be carefully optimized to have the least
input and weight bits are sliced without requiring additional precision impact on neural network accuracy. In general, this optimization is
in the memory elements. Newton153 uses Karatsuba’s divide-and-con- data-dependent: as a basic example, dot products involving signed
quer algorithm to reduce the number of crossbar computations inputs and weights tend to have values that are clustered closer to zero
needed: if both the weights and inputs are split into two slices, for than those involving only positive values.63 The appropriate ADC res-
example, only three crossbar reads are needed with divide-and-con- olution and range for each crossbar can be calibrated offline based on
quer rather than four. This results in lower ADC usage but requires a training data.
third crossbar to implement an extra set of multiplication operands, When not computing at full precision, a further challenge arises
which imposes a large area overhead. The algorithm can be applied when composing the result of a large VMM whose MACs must be
recursively, with compounding overhead. In the Newton architecture, split across multiple crossbars. If a single crossbar has an insufficient
a single application resulted in 25% greater energy efficiency. number of rows, partial sum outputs from different crossbars should
be added before any nonlinear function is applied. If this addition
B. Reducing ADC overhead occurs after the digitization step, as is typically done, care must be
In full precision architectures, the required ADC resolution must taken that the crossbar outputs do not saturate the ADC: otherwise, a
account for the worst case: every input xi uses all Bin bits of input pre- nonlinearity may be introduced, which was not expected during train-
cision, and every weight Wij uses all B~ w bits of weight precision needed ing. Alternatively, to avoid the accuracy loss, the VMM splitting and
in a crossbar read. The maximum value of the dot product computed the resulting nonlinearity can be explicitly accounted for in the net-
on a column sets the required ADC resolution,68,119,136 work topology during training (or re-training).165

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-17


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

The overhead of the ADC circuitry can also be lowered without compatible with synaptic bit slicing: a separate pair of devices would
sacrificing the full precision assumption by using networks that are be used for every bit slice.136,143 For crossbars that utilize devices with
trained to use binary weights166 and/or binary neurons.157,167–169 This unipolar input–output characteristics, such as floating-gate transistors,
strategy effectively shifts the accuracy trade-off to the algorithm level four devices per weight may be needed to perform four-quadrant
and can be beneficial in some applications since binary neural networks multiplication.105,112,137
have been shown to reach inference accuracies that may be acceptably It is also possible to perform the subtraction in the digital
close to those obtained using floating-point weights, as described in domain, after digitizing the results from both crossbars. However, this
Sec. II C. Using binary activations greatly reduces the ADC overhead, approach requires twice the ADC usage. A further advantage of analog
and the use of binary weights reduces the number of memory devices subtraction is that it more efficiently exploits the fact that with real-
needed. Both schemes can improve tolerance to noise, low yield, and valued weights, the value of the dot product—and thus the voltage on
variability in the device conductance values.167 When splitting a large the integrator or TIA—will on average be closer to zero. For accelera-
VMM across multiple crossbars, several bits of ADC resolution are still tors that are not designed for full precision, this allows a significant
needed for the accumulation of partial sums prior to applying the reduction in the size of the integrating capacitor and the ADC range.62
binary activation function.157,168 While binary neural networks can A single-crossbar implementation of analog subtraction has also
bring substantial area and energy benefits to analog crossbar accelera- been proposed, which adds a column of reference bias resistors (or
tors, we note that digital accelerators and PIM architectures can also memristors) to the array, whose conductance Gb is effectively sub-
effectively exploit the reduced memory requirements and the greatly tracted from the conductance of every device using an analog
simplified MAC operations (which reduce to XNOR and bit counting inverter.172,173 While more area-efficient than a two-crossbar imple-
steps) that are involved in processing these networks.170 mentation, this approach is more susceptible to errors arising from
variability, drift, or offset in the shared reference resistors and the ana-
C. Signed computation log inverter.62
Real-valued weights have been implemented within a single
Though device conductances Gij are strictly positive, several
crossbar using two digital approaches. If the devices in the crossbar are
approaches exist to handle both positive and negative synaptic weights
binary, as in the memristive Boltzmann machine,12 then a real-valued
Wij that either require two crossbars per weight or are compatible with
weight can be stored in a digital two’s complement representation
a single crossbar per weight. In the most commonly used two-crossbar
across the 1-bit memory devices. If the input is also bit sliced to a sin-
implementation, a real-valued weight is realized using the difference of
gle bit, the resulting partial sums at the column outputs will be repre-
two conductances,62,64,128,171
sented in two’s complement format after digitization. In the case of
X
N r 1  inputs that are sliced to a single bit but multiple synaptic bits per
Ij ¼ Gþ 
ij  Gij Vi : (11) memory device, ISAAC68 stores the weights in a biased digital repre-
i¼0 sentation; after digitizing the column outputs, the bias is subtracted to
The current subtraction in the above equation can easily be imple- obtain the real-valued dot products. Similar to the analog bias column
mented in the analog domain by applying the same voltage input with scheme described above, this method requires an additional column
a different polarity to the two memory elements and summing the with uniform weights and suffers from the same variability issue,
currents using Kirchhoff’s law,62 as shown in Fig. 15. This approach is though this is less likely to be problematic if the array elements store
low-precision bit slices (two bits in ISAAC).

D. Hierarchical organization
Much like digital accelerators, analog neuromorphic accelerators
that store and process large neural network layers can benefit from
grouping the memory crossbars into clusters or tiles that can efficiently
share a set of resources. These shared resources might include local
memory to store activations (usually in the form of SRAM or embed-
ded DRAM buffers,12,68,148,153 but in a few cases, ReRAM buffers have
been used136,156,174) neuron circuitry, shift-and-add reduction units, a
control unit, and a local communication network. Tiles may be further
grouped to perform higher-level reductions or to accommodate the
mapping of large neural network layers. A hierarchy with two to four
tiers is used, for example, in RENO,139 the memristive Boltzmann
machine,12 PRIME,136 ISAAC,68 Newton,153 and PUMA.148 A few
examples are shown in Fig. 16. Hierarchical groups of crossbar proc-
essing elements can operate independently and concurrently by, for
example, processing different input batches, different layers, different
partitions of a large VMM, or different sliding windows of a
FIG. 15. A general scheme to represent positive and negative weights. When a
positive input pulse is sent to the left crossbar, a negative pulse of the same magni- convolution.
tude is sent to the right crossbar and vice versa, and a subtraction is performed by A neural network workload is typically mapped to hardware tiles
Kirchhoff’s law. with the goal of maximizing input data reuse and thereby reducing

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-18


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

FIG. 16. Organization of crossbar-based computational units in (a) the memristive Boltzmann machine, (b) PRIME, and (c) ISAAC. Reproduced with permission from Bojnordi
and Ipek, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA) (2016). Copyright 2016 IEEE; Chi et al., in ACM/IEEE 43rd
Annual International Symposium on Computer Architecture (ISCA) (2016). Copyright 2016 IEEE; and Shafiee et al., in Proceedings of the 43rd International Symposium on
Computer Architecture (ISCA ’16) (2016). Copyright 2016 IEEE.12,68,136

buffer sizes and data movement: this is considered, for example, in The 3D-aCortex accelerator, which does not use bit slicing, is a
PUMA and Newton.148,153 In bit-sliced architectures, further area and notable example of a non-hierarchical architecture that capitalizes on
energy savings are possible by efficiently consolidating a set of cross- the advantages of a 3D-NAND flash array.117 Each of the 64 2D planes
bars whose outputs are aggregated or reduced by a summation opera- of the array contains a 32  16 grid of floating-gate crossbars, each of
tion, such as the bit slice sum in Eq. (9). For example, Newton’s tiles which has dimensions of 64  128. At a given time step, a single plane
are composed of several in situ multiply accumulate units (IMAs), is selected and VMMs are executed on its crossbars, whose outputs are
each of which is a collection of crossbars interconnected by an HTree encoded as pulse lengths (described in Sec. V B). Partial outputs from
network with integrated shift-and-add circuitry, as shown in Fig. 17. multiple crossbars are temporally aggregated and digitized using digi-
Partial sums from crossbars dedicated to adjacent bit slices are com- tal counters, shared by all crossbars along a row of the grid, avoiding
bined at a shared shift-and-add unit at a leaf of the HTree, while the communication overhead of performing these reductions across
results from distant bit slices are aggregated closer to the center. Thus, multiple levels of a hierarchy. The entire 3D array shares a global
the partial sums computed by the IMA can be fully supported by a memory and a column of peripheral circuits, increasing its storage
shared communication bus that is made progressively narrower at the efficiency.
edges. While this structure constrains the IMA to map to a single
weight matrix, leading to reduced hardware utilization in some cases,
it enforces input sharing between crossbars and leads to a significant E. Convolutional neural network inference
reduction in the local bus width. These benefits translate to a reduced acceleration
IMA area and communication energy in comparison to more flexible Forward propagation in MLPs involves potentially large weight
systems that provide private input and output buses to individual matrices and input activations, but data reuse is significant only when
crossbars. performing inference in large batches. Their processing is therefore

FIG. 17. The in situ multiply accumulate


(IMA) unit in Newton, which constrains
mapping to reduce the area. A 16-bit
weight matrix is sliced across eight cross-
bars containing 2-bit memory devices, and
outputs of adjacent crossbars are com-
bined at a shared shift-and-add (S&A)
unit. This allows an HTree network within
the IMA that is narrower at the edges,
where the crossbars reside. This diagram
does not include the implementation of
Karatsuba’s algorithm. Adapted from Ref.
153.

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-19


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

limited by memory bandwidth.22 In this case, analog computation computation. Of the analog accelerators developed to date, the PUMA
inside memory crossbars has a large intrinsic advantage since it elimi- architecture is notable in its highly flexible and programmable design,
nates the movement of weight data. CNNs, on the other hand, are which provides the necessary circuit blocks to support MLP, CNN,
compute-bound due to the large number of sliding window operations LSTM, and various other workloads while balancing the trade-off
involved and the significantly larger ratio of computations to weights. between flexibility and its overhead on performance, area, and energy
They make less efficient use of the fact that the VMMs are performed efficiency. It also provides a high-level programming interface—in the
in memory, as digital architectures can capitalize on the extensive data form of an ISA and a compiler—between the neural network workload
reuse in these operations to amortize data access costs. For CNNs, the and its tiled, spatial microarchitecture.148
primary advantage of analog accelerators over their digital counter- Because of the large write energy and latency associated with
parts must come from the intrinsic parallelism of the crossbar VMM weight programming, analog inference accelerators cannot dynam-
operation. Since the ADC tends to dominate array energy costs, the ically remap different neural network layers onto its hardware dur-
energy efficiency of analog computation is best capitalized by convolu- ing runtime—a feature that is natively supported by digital
tional layers with larger filter sizes and/or a large number of input accelerators. Nonetheless, many analog architectures offer signifi-
channels, both of which increase the number of rows in the crossbar cant flexibility in the network topologies and layer sizes that can be
that share the same ADC. mapped onto them. Finding an area- and energy-efficient mapping
In terms of the number of computations, CNNs are heavily is a primary function of the compilers in PUMA148 and PRIME.136
front-loaded: many more results are needed from the first layer for In 3D-aCortex, a greedy optimization algorithm is used to map the
every result produced by a later layer. For an analog memory-based layers of a large neural network to the fewest number of 2D planes
accelerator, this can lead to long delays in the early layers and an in the 3D-NAND array.117
unbalanced utilization of the resources dedicated to each layer. To One consequence of performing computation in memory struc-
address this, ISAAC replicates the weights in the early layers over tures is that different workloads will tend to more or less efficiently uti-
many physical crossbars, which execute in parallel.68 By exploiting the lize the hardware resources in the crossbars. The trade-off between
fact that a convolutional layer only needs to partially finish its compu- efficiency and flexibility becomes particularly evident in the choice of
tation before the next layer can begin, the ISAAC pipeline increases the crossbar dimensions, which are usually homogeneous across the
throughput and minimizes the size of the shared tile memory. system: smaller crossbars can be more effectively tiled across weight
Maintaining balanced resource utilization requires enough weight rep- matrices of various sizes and maintain high area utilization, while
lication in the earlier layers to keep the crossbars in the final convolu- larger crossbars possess greater peak compute density and can provide
tional layers busy. Fully connected classification layers lying at the end greater amortization of the ADC and communication overheads
of a CNN have different demands than convolutional layers: since across operations. The trade-off also appears in designing the local
they compute only one VMM per example, they are used more interconnection fabric between crossbars. As discussed in Sec. VI D,
sparsely in time and do not require the same buffer sizes as convolu- Newton’s IMA uses a fixed reduction network to save area and energy,
tional layers. Newton (a successor to ISAAC) exploits this by allocating which is well suited to its style of bit slicing. Meanwhile, the memris-
different tile designs to convolutional and fully connected layers, lead- tive Boltzmann machine’s higher-level reduction tree, which aggre-
ing to a higher energy efficiency and throughput per area than ISAAC gates the partial sums of a large split VMM, is reconfigurable; on
but at some cost to workload flexibility.153 mapping the hardware to a workload, each node is set to the forward-
Similar to digital CNN accelerators, analog systems can save ing (F) or reduction (R) mode as shown in Fig. 16(a).12 This strategy
energy by reducing input and output data movement. To match the provides greater flexibility while still saving energy but not area.
data access patterns of sliding window computations, shift regis- PRIME is unique in that it provides “full-functional” subarrays, which
ters117,118 and line buffers168 are commonly used to pass data into a contain crossbars that can be operated both in computation (VMM)
convolutional layer. PUMA uses an input shuffle unit to dynamically mode and in standard memory mode, where they serve to enlarge the
re-route values in different registers of an input buffer to different buffers for activation data.136
crossbar DACs, exploiting the large input data reuse in adjacent sliding At the extreme end of reconfigurability, the Field Programmable
windows to reduce data movement in the buffers.148 Analog Array (FPAA) provides a flexible physical infrastructure—
While all the analog architectures for large-scale CNN accelera- with software support181—for integrating various analog
tion have been evaluated only in simulation, a small-scale analog CNN functionalities and for prototyping new architectures for analog
accelerator has been realized fully in hardware.180 The system uses computing.79,150 Floating-gate VMM crossbars and their periph-
eight 128  16 TaOx/HfOx ReRAM crossbars to implement two con- eral circuitry,105 along with other neuronal functional blocks such
volutional layers and one fully connected layer and achieves 96% accu- as winner-take-all circuits,113 have been implemented in FPAAs.
racy on MNIST. The Field Programmable Crossbar Array (FPCA) is a reconfigura-
ble architecture that maps several kernels—analog VMM with
F. Flexibility and reconfigurability binary neural networks, digital arithmetic operations, data transla-
tion, and data storage—onto its many binary ReRAM crossbars
The ability to reconfigure an inference accelerator for different
and even to virtual tiles within a crossbar.182
machine learning workloads is a useful functionality to the end user
but can introduce complexity to the instruction set, lead to additional
circuits or logic units for specialized functions, and increase the needed G. Performance comparison of inference accelerators
buffer sizes and bus widths when provisioned for the worst case153— Tables II and III compare the high-level design parameters and
all of which offset the area and energy efficiency of crossbar performance (measured and simulated) of several selected digital and

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-20


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

TABLE II. Comparison of selected digital and mixed-signal neural network inference accelerators from industry and research.a TOPS: Tera-Operations per second. We have
counted MACs as single operations where possible. Note that performance (TOPS) is measured at the specified level of weight and activation precision, which differs
between accelerators. The results for NVIDIA T4, TPU, Goya, UNPU, and Ref. 122 are measured; others are simulated. TOPS/mm2 values are based on the die area, where
provided.

Google Habana Reference 122


NVIDIA T4175 TPU v122,b Goya HL-1000176 DaDianNao44 UNPU51 mixed-signalc

Process 12 nm 28 nm 16 nm 28 nm 65 nm 28 nm
Activation resolution 8-bit int 8-bit int 16-bit int 16-bit fixed-pt. 16 bits 1 bit
Weight resolution 8-bit int 8-bit int 16-bit int 16-bit fixed-pt. 1 bitd 1 bit
Clock speed 2.6 GHz 700 MHz 2.1 GHz (CPU) 606 MHz 200 MHz 10 MHz
Benchmarked workload ResNet-50177 Mean of six MLPs, ResNet-50 Peak Peak Co-designed binary
(batch ¼ 128) LSTMs, CNNs (batch ¼ 10) performance performance CNN (CIFAR-10)
Throughput (TOPS) 22.2, 130 (peak) 21.4, 92 (peak) 63.1 5.58 7.37 0.478
Density (TOPS/mm2) 0.04, 0.24 (peak) 0.06, 0.28 (peak) … 0.08 0.46 0.10
Efficiency (TOPS/W) 0.32 2.3 (peak) 0.61 0.35 50.6 532
a
To enable performance comparisons across a uniform application space, we did not consider accelerators for spiking neural networks.
b
The TPU v2 and v3 chips, which use 16-bit floating point arithmetic, are commercially available for both inference and training on the cloud. MLPerf inference benchmarking
results for the Cloud TPU v3 are available,179 but power and area information is undisclosed. The TPU v1 die area is taken to be the stated upper bound of 331 mm2; the listed
TOPS/mm2 values are therefore a lower bound.
c
The mixed-signal accelerator in Ref. 122 performs multiplication using digital logic and summation using analog switched-capacitor circuits.
d
The UNPU architecture flexibly supports any weight precision from 1 to 16 bits. The results are listed for 1-bit weights.

TABLE III. Comparison of selected analog neural network inference accelerators. Note that performance (TOPS) is measured at the specified level of weight and activation pre-
cision, which differs between accelerators.

Memristive
ISAAC68 Newton153 PUMA148 PRIME136 Boltzmann machine12 3D-aCortex117

Process 32 nm 32 nm 32 nm 65 nm 22 nm 55 nm
Activation resolution 16 bits 16 bits 16 bits 6 bits 32 bits 4 bits
1-bit sliced 1-bit sliced 1-bit sliced 3-bit sliceda 1-bit sliced temporal
Weight resolution 16 bits 16 bits 16 bits 8 bits 32 bits 4 bits
Weight storage ReRAM ReRAM ReRAM ReRAM ReRAM NAND flash
8  2-bit 8  2-bit 8  2-bit 2  4-bit 32  1-bit 1  4-bit
Rhigh 2 MX [65] 2 MX [65] 1 MX 20 kX 1.1 GX …
Rlow 2 kX [65] 2 kX [65] 100 kX 1 kX 315 kX 2.3 MX
Array size 128  128 128  128 128  128 256  256 512  512 64  128
ADC type SAR (8-bit) SAR (8-bit) SAR Ramp (6-bit) SAR Temporal to
digital (4-bit)
Clock speed 1.2 GHz 1.2 GHz 1.0 GHz 3.0 GHz 3.2 GHz 1.0 GHz
(host CPU) (host CPU)
Benchmarked workload Peak Peak Peak … … GNMT178
performance performance performance
Throughput (TOPS) 41.3b … 26.2 … … 10.7
Density (TOPS/mm2) 0.48 0.68 0.29 … … 0.58
Efficiency (TOPS/W) 0.63 0.92 0.42 … … 70.4
a
PRIME splits the input into two 3-bit slices and uses a 3-bit DAC to convert each digital slice to one of the eight voltage levels.
b
The ISAAC-CE design point from Ref. 68 is used.

analog inference accelerators, respectively. Of those accelerators listed, inference benchmarking project provides an apples-to-apples perfor-
the NVIDIA T4, Google TPU, and Goya are commercially available mance comparison of these accelerators for several benchmark deep
on premise or in the cloud—the remainder are research systems. For learning tasks.179 With the exception of Ref. 122, all the architectures
many of the other commercial digital systems, the reported perfor- for which performance numbers are given can support arbitrary CNN
mance values are aggregated in Ref. 4. In addition, the MLPerf and MLP workloads; many can also natively support LSTMs. For the

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-21


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

analog accelerators, performance details on specific benchmark mod- simplify the hardware needed for programming, the Manhattan
els can be found in their respective papers. update rule—in which the weight updates are binary (positive or nega-
The projected performance of the analog accelerators in tive)—has commonly been used.127,130 While this update rule reduces
Table III approximately reaches parity with the measured perfor- hardware complexity and significantly speeds up programming, Ref.
mance, compute density, and energy efficiency of the best digital 132 shows that it incurs several percentage points of accuracy loss
accelerators when adjusted for different arithmetic resolutions. (shown on a simple facile recognition task) compared to analog
Further improvements are still likely needed before analog acceler- updates using write-verify schemes.
ators can become widely adopted for commercial use. A significant Recently, Ref. 131 integrated a memristor crossbar with CMOS
bottleneck for many of the analog architectures is the ADC, which circuitry that computes the weight updates and applies the appropriate
consumes 49% of the total chip power in ISAAC and 41% in the electrical pulses for programming. The system successfully trained a
memristive Boltzmann machine. Newton reduces the ADC cost by single-layer perceptron using batch gradient descent. It also trained a
as much as 60% over ISAAC via several previously mentioned small two-layer neural network using Sanger’s rule for principal com-
techniques, such as adaptive ADC resolution and divide-and-conquer. ponent analysis (PCA),185 though unlike backpropagation, this update
Adding more rows to the crossbar can allow greater amortization of rule does not require the propagation of data between layers.
this cost across operations but would require higher ADC resolution in
full-precision architectures. A. Supporting backpropagation
VII. ARCHITECTURES FOR TRAINING To provide a significant acceleration of the backpropagation
algorithm, both the computation of the layer-by-layer errors d and
Training large neural networks within an analog neuromorphic the applications of the weight updates DW should take advantage of
accelerator is considered a longer-term goal for the field due to the the intrinsic parallelism provided by the crossbar. The former process
challenge of meeting the additional device requirements enumerated is relatively simple to implement on an accelerator that has already
in Sec. IV A—update linearity, update symmetry, high precision, low been designed for inference. Since backpropagation by MVMs is the
write energy and latency, and high endurance. Most of the consider- transpose of forward propagation by VMMs, it can be parallelized in
ations discussed in Sec. VI for inference accelerators also apply to the same manner so long as the architecture supports both the for-
training accelerators, which must support both forward and backward ward and backward flow of data. With additional routing around the
propagation. In this section, we will review the special architectural crossbars, the same peripheral circuitry (such as DACs, integrators,
considerations for training accelerators and the inherent parallelism TIAs, and ADCs) can be reused for an MVM computation.186 In
provided by the memory crossbar for weight update operations. Ref. 62, for example, MVMs are supported with minimal overhead
Despite the stringent device requirements, successful training of using additional logic to re-route the inputs to the columns and an
memory crossbars has been demonstrated experimentally for smaller analog multiplexer to allow the integration of currents on the rows.
tasks using less demanding training schemes. Most of these results use This reconfigurable crossbar-based neural core is shown in Fig. 18.
an in situ training approach that represents an intermediate step The backpropagation step also requires the application of the
toward a true training accelerator: forward propagation is performed derivative of the neuron activation function f 0 in Eq. (2). For functions
on the crossbar, the weight updates DW are calculated on an off-chip such as the sigmoid or hyperbolic tangent, this can be provided by a
processor, and the updates are programmed onto the crossbar.183 This separate digital lookup table as described in Sec. V D. In an alternative
approach, used in many demonstrations,127,128,130,134,171,184 is suitable approach, Ref. 160 computes the sigmoid derivative in the analog
as a platform to test the learning capability of the memory devices. To domain by applying analog arithmetic (with operational amplifiers) to

FIG. 18. Reconfigurable neural core in Ref. 62 for implementing (a) VMM, (b) MVM, and (c) outer product update. The inputs to the VMM and MVM are coded as pulse trains
[see Fig. 10(d)] using the temporal coding logic. The DAC function of the voltage coding block is bypassed during a MVM but is used during an outer product update.
Reproduced with permission from Marinella et al., IEEE J. Emerging Sel. Top. Circuits Syst. 1, 86–101 (2018). Copyright 2018 IEEE.62

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-22


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

the output of the sigmoid unit. The derivative of the ReLU function is but applies the time encoded signal to access transistors along a column
typically much simpler to implement since it is a binary function that rather than directly across the devices. The pulse length encoding is
depends only on the sign of the argument. The calculation of f 0 can be done using a circuit that takes an analog rather than digital input for d.
pipelined with the VMMs during forward propagation, and the Alternatively, the multiplication effect can be achieved at a con-
MVMs during backpropagation can be pipelined with the weight stant write voltage or power by encoding one variable in the length or
updates, as done in TIME156 and PipeLayer.174 duty cycle and the other variable in the repetition rate of two overlap-
For the derivative calculation step and the weight update step, it ping pulse trains.187,188 However, as with pulse length encoding for
is necessary to have on hand the values of any given layer’s activation VMMs, encoding the two vectors fully in the time domain leads to a
x and the computed values of f 0 . An important distinction from an stronger exponential penalty in the latency by the desired bits of preci-
inference-only accelerator is that these intermediate results generated sion in DW. In the hybrid temporal-voltage coding scheme, this pen-
during forward propagation must not be overwritten until the errors alty can be efficiently shared between the write latency and the area/
have been propagated back to that layer.186 This introduces a substan- energy overhead of the DAC so that neither is very large.
tial additional requirement on the size of the local memory accessible The parallel outer product improves the weight update latency
to each crossbar, which must fully store the values of x and f 0 to mini- by O(N) compared to a serial row-by-row programming scheme.
mize data movement during backpropagation. Another advantage is that only the vectors x and d need to be stored
rather than the full weight update matrix DW. One drawback is the
B. Parallel outer product update potentially large instantaneous power that might be drawn by pro-
gramming all (or one quarter) of the memory elements simulta-
By applying programming pulses simultaneously to all the rows
neously. To reduce the power consumption, the parallel update can
and columns of a crossbar, the outer product weight update in Eq. (3)
be partitioned into blockwise updates with sufficiently large blocks to
can be executed in parallel for the entire array. To accomplish this, the avoid breaking the advantages of parallelism.134 The amount of power
activations x are applied to one edge of the crossbar and the errors d to consumed during the weight update is data- and task-dependent. In
the other edge in such a way that a multiplication effect between the some applications, the fully parallel update consumes only marginally
signals is seen at the crosspoints. In Ref. 62, this is done by encoding xi more instantaneous power than a row-by-row update.187
as a temporal signal (pulse length or pulse train) and dj as a pulse
amplitude using temporal coding blocks and voltage coding blocks (i.e.,
C. Training with batch size > 1
DACs), respectively, as shown in Fig. 18(c). The parallel outer product
update is depicted in Fig. 19. The programming is performed in four Although the parallel outer product update operation car-
phases to account for the possible sign combinations of x and d without ries significant benefits to latency, energy consumption, and
causing write disturb. The learning rate g can be controlled by scaling storage overhead, it is inherently incompatible with batch sizes
the pulse lengths or the number of pulses (or pulse trains) fired.63 greater than one, whose update can no longer be expressed as
Reference 160 implements a similar temporal-voltage encoding scheme a rank-1 matrix as in Eq. (3). Being limited to stochastic gra-
dient descent learning can be problematic as it provides a less
accurate estimate of the true gradient.17 It also precludes vec-
torization or pipelining of forward and backward propagation,
thus increasing the training time and under-utilizing the hard-
ware. Finally, it imposes greater endurance requirements on
the devices, which need to be programmed after every training
example.
Batch training can be realized efficiently, without relying on fre-
quent serial updates, by allocating two crossbars per weight matrix.63
Forward propagation is performed using the first crossbar, and parallel
outer product updates are applied to the second crossbar, which is ini-
tialized as a zero weight matrix. At the end of a batch, the total accu-
mulated weight update is read out from the second crossbar and
serially written to the first. Since a slow serial programming step is
needed only during the weight transfer process, its cost is amortized
over all the examples in a batch.
The PipeLayer accelerator takes the above approach of using a
second ReRAM array as a buffer to store intermediate weight
updates in a batch.174 Furthermore, it physically duplicates each
weight matrix so that one is used for forward propagation and the
other for backpropagation. In this way, the two processes can be
pipelined within a batch without weight conflicts. At the end of the
batch, the accumulated weight updates are copied from the buffer
FIG. 19. Parallel outer product update of a crossbar array, shown using temporal array to all crossbars containing the copies of the weight matrix.
(pulse length or pulse train) coding for the activations and voltage amplitude coding The dataflow for this pipeline is shown in Fig. 20(a) for a single
for the errors. training example and in Fig. 20(b) for multiple examples. The

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-23


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

FIG. 20. The PipeLayer architecture. (a) Dataflow of a single example through a 3-layer neural network for training. Each time step may correspond to several computational
cycles. The computational blocks active in each time step are labeled: forward propagation (blue), backpropagation (red), and weight update (red). (b) Pipeline for multiple
training examples. The weight updates DW are serially written to the W and WT crossbars at the end of a batch. Adapted from Ref. 174.

intermediate activations x are also stored in dedicated local ReRAM E. Compensation for device imprecision and
buffers. The forward and backward pipelining ensures that the storage asymmetry
requirement for these activations is determined by the layer’s depth Owing to the small parameter adjustments mandated by gradient
within the network rather than by the batch size. Since these activations descent, neural network training requires a higher level of precision in
are accessed in a first-in first-out order, PipeLayer operates these the weight representation than needed during inference. Since individ-
ReRAM arrays as circular buffers. ual analog memory devices are limited to at most 7–8 bits of precision,
Another approach to speed up batch training and to reduce the some form of bit slicing is typically used.96,174 For example, PipeLayer
endurance requirement on the memory devices is to find a good rank- uses 4-bit ReRAM cells. During weight update, the full values of the
1 approximation to the high-rank batch update. The best such approx- original 16-bit weights are read out from the crossbars, the update is
imation can be constructed from the largest singular value of the batch computed digitally, and the new weights are written back to all the
update matrix and its left and right singular unit vectors. Ref. 189 uses devices.174 This process incurs a substantial time and energy penalty
a method based on streaming PCA to refine an estimation of these due to the need for serial readout and programming with every
quantities as the values of x and d are found for each example in a update.
batch. This can be done with a storage overhead that scales as O(N) Bit slicing is, without modification, not fully compatible with a
with the crossbar dimension. Accuracy on MNIST was shown to con- parallel outer product update. The periodic carry technique, shown in
verge to similar levels as both stochastic and minibatch gradient Fig. 21(a), mitigates this problem by allocating a portion of the con-
descent but with about 10 fewer matrix updates as SGD for a batch ductance range of a device to store one or more carry bits.190 Taking
size of 32. advantage of the fact that most weight changes are small, parallel outer
product updates are carried out on the LSB devices only. Periodically,
D. Training convolutional neural networks the LSB device is read, and any carry bits are propagated to the device,
which stores the next higher bits. Thus, the cost of serial readout and
Training introduces additional complexities to the acceleration of programming can be amortized over many updates. For some devices,
CNNs. In ISAAC, which is an inference accelerator, high throughput this technique potentially curbs the effects of device nonlinearity: as
is achieved by providing sufficient replication of the weights in the ear- more conductance levels are dedicated to the carry while performing
lier layers to keep the later layers maximally busy.68 Many cycles may relatively frequent carry propagation, each device can be more effec-
nonetheless be needed to fill its pipeline. If the same hardware is used tively constrained to the most linear part of its conductance range.
for training, the benefit of high throughput is eliminated if the deep Reference 141 performs a different form of bit slicing with two
pipeline has to be cleared at the end of every batch. The second benefit different types of devices: a PCM device holds the MSBs, and a capaci-
of ISAAC’s pipeline, a reduced buffer size, is also eliminated since the tor connected to a read transistor is used as a memory element for the
full activation must be retained in memory for backpropagation. LSBs. The scheme is shown in Fig. 21(b). Charge is added or removed
PipeLayer further increases the amount of weight replication in each from the capacitor by a pair of CMOS current sources; the high linear-
layer to reduce the latency of a single example during training. The ity of these updates is used to overcome the issues of nonlinearity,
crucial trade-off is between the area and training throughput since endurance, and strong asymmetry in programming the PCM device.
requiring that early layers be computed in a single cycle results in a The volatile state of the capacitor is periodically written to the non-
prohibitively large number of replicated crossbars.174 volatile PCM device as in periodic carry. The cost of serial readout and

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-24


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

FIG. 21. (a) The periodic carry tech-


nique.190 A portion of the device dynamic
range is used to store a carry (overflow),
which is periodically propagated to the
higher bits. Parallel outer product update is
carried out only on the device holding the
LSBs. (b) Reference 141 uses a PCM
device to hold the MSBs of the weight and
a capacitor with charging/discharging cir-
cuitry to hold the LSBs; both are paired
with a second device to implement signed
weights and to overcome update asymme-
try. For the capacitive memory, the second
device is shared by a row, effectively sub-
tracting a bias. Reproduced with permis-
sion from Ambrogio et al., Nature 558, 60
(2018). Copyright 2018 Springer Nature.141

programming is thus spread out over many updates, but the timescale algorithm level by using different learning rates for updates of different
of charge leakage from the capacitor must be considered. A disadvan- signs196 and by using adaptive neuron-specific learning rates.197
tage of this scheme is the larger size of each synaptic unit cell. The “Tiki-Taka” method198 mitigates the effect of asymmetric
Reference 191 combats the effects of weight imprecision by nonlinearities by splitting the weight matrix into a sum of two matri-
computing and accumulating the weight updates over a batch in ces: W ¼ A þ cC, each of which is mapped onto two crossbars as in
high precision in a digital co-processor. At the end of a batch, these Fig. 15. The crossbars for A are first calibrated to an operating point
updates are quantized to a low precision that is compatible with where A ¼ 0, and the devices have a symmetric update response. This
the programmable resolution of the PCM synaptic devices, and the is done by applying a sequence of alternating positive and negative
residual between the high-precision (floating-point) and low- pulses to every device until a steady-state conductance is reached.
precision updates is kept in the digital co-processor as an initial During training, parallel outer product updates are applied to A, and
value in the accumulation of the next weight update. By applying the weights in A are periodically transferred to C. Since the destructive
updates only once per batch to the memory elements, with the effects of an asymmetric nonlinearity are usually accumulated over
more granular per-example update computations done in the digi- many updates, they can be largely eliminated by performing most of
tal domain, the effects of device nonlinearity and stochasticity can the updates accurately near the symmetric operating point (in A) while
be alleviated. However, in offloading a large fraction of the compu- only sparsely updating the devices (in C) that operate in a more asym-
tational burden to a separate digital co-processor, the highly metric regime.
energy-efficient analog computation contributes a diminished
VIII. MITIGATING ARRAY- AND DEVICE-LEVEL NON-
share of the total training energy cost. IDEALITIES
Weight update asymmetry, which is significant in PCM and
some ReRAM devices, is known to considerably degrade the accuracy The accelerators that we have considered face unique challenges
of neural networks.87,192 When combined with nonlinearity, asymme- compared to their digital counterparts, in part due to the analog nature
try causes weights to decay toward zero after a succession of positive of their computation and in part originating from their use of a dense
and negative updates are applied.86 One solution to asymmetry is to array of memory elements. In spite of the intrinsic error resilience of
apply updates of only a single polarity to a pair of devices: one device neural networks, which are designed to generalize well to diverse and
to increase the weight and another to decrease the weight. This can be noisy real-world data, we have seen that the devices that make up these
done without additional area overhead since the same device pair that neuromorphic accelerators must meet a demanding set of requirements.
is used to implement real-valued weights (Fig. 15) can be used for this This is made more challenging by the fact that most of the non-volatile
purpose. Occasionally, after enough updates, one or both devices may memory devices that are well suited for this application are still emerging
saturate or enter a highly nonlinear regime; thus, the device conduc- technologies. In this section, we briefly survey some techniques—in
tances must be periodically read, reset back to an initial conductance addition to those previously discussed—mainly at the architecture and
state, and then restored to represent the correct weight.134,193 This cost algorithm level to mitigate the limitations of memory arrays and the
is amortized over many updates and may also be finely calibrated individual memory devices on neural network performance.
given prior knowledge about the average number of analog states in
the asymmetric device.194 Writing to only one device per update can A. Parasitic resistance
also increase device resilience and decrease the number of refresh steps As discussed in Sec. IV C, line resistances along the rows and col-
needed.195 Asymmetry can also be partially compensated at the umns impose a limit on the maximum feasible size of a crossbar. This

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-25


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

is an important limitation, as it forces a large weight matrix to be parti- pruning step to achieve this type of matrix compression. Reference
tioned across multiple crossbars and increases the area and energy 204 obtains a similar outcome, but starting with a sparse neural net-
overhead of the peripheral circuits relative to that of the more efficient work, by re-arranging the matrix columns using k-means clustering,
crossbars. The penalty for parasitic resistance, which reduces the fidel- pruning elements that still remain outside of fixed-size blocks, and re-
ity of array reads and writes, comes mainly at the expense of neural training the network. They show that at least on relatively small
network accuracy. At the circuit level, this issue can be partially ReRAM crossbars (such as 32  32), their algorithm can largely retain
addressed by using a 1T1R array in which the VMM inputs are applied the baseline accuracy of several large-scale neural networks and
to the gate of an access transistor as a bit-sliced or temporally encoded achieve a several-fold reduction in energy.
signal. Since the gate terminal has a large resistance, the effects of para- For CNNs, an alternative to weight pruning is filter pruning,
sitic voltage drops can be minimized on the communication of the where entire convolutional filters and their associated feature maps
input activations but not of the partial sums. and filters in the subsequent layer are removed.205 This method has
At the technology level, large arrays can be enabled by increasing been shown to compress large image recognition networks by more
the interconnect width, which reduces the line resistance or by than 30% with insignificant accuracy loss and is amenable to a cross-
decreasing the conductance of the memory device, which reduces the bar implementation as it implies only the removal of entire rows and
voltage drops. The latter can be achieved using floating-gate synapses columns.
in subthreshold mode, which have conductances on the order of 1 to
10 nS105 but require a relatively large voltage to program. Another
pathway is to engineer emerging memories with low conductance. C. Hardware robustness to noise, drift, and device
ReRAM devices with a conductance as small as 10 nS (and less than failures
10 lA of write current) have been demonstrated by inserting an oxide Noise and cycle-to-cycle variability in the readout currents of the
tunneling barrier into the material stack, but they also possess greater individual memory elements can cause random errors in a VMM
susceptibility to random telegraph noise.199 Conductances smaller computation, while endurance failures, manufacturing defects, and
than 100 nS have also been demonstrated using redox transistors with programming errors can lead to persistent errors.164 The nature and
a highly diluted conductive polymer.72 severity of these issues is technology-dependent: ReRAM devices, for
Several methods have been proposed to mitigate the neural net- example, suffer from random telegraph noise and can become stuck at
work accuracy loss that arises from voltage drops across the array— high or low resistance after sufficiently many write cycles.
this degradation effect is quantified in Ref. 200. Reference 65 calibrates Reference 164 proposes an arithmetic error-correcting code for
for the predictable effects of the voltage drops using a nonlinear map- resistive crossbars. Unlike conventional error correcting codes that
ping of the weight matrix to the memory array, found via circuit simu- detect and correct individual bit flips, arithmetic codes correct for
lations. Reference 201 adds series resistances to the periphery of the additive syndromes, which are more likely to arise in a crossbar where
array to equalize the parasitic voltage drops seen by all parts of the
multiplications and summations are carried out on analog signals.
crossbar. Reference 200 uses a lumped circuit model of the crossbar to
Figure 22 shows the proposed error correction unit as an added com-
show that the ideal crossbar currents (without parasitics) can be
ponent inside an ISAAC IMA.68 The first divide/residual unit decodes
approximately recovered from the real current using an analytical
the VMM output and calculates a residual that can be used to index a
expression, provided that the line resistances can be estimated or mea-
correction table; the overhead of this table is minimized by considering
sured. Reference 202 shows that a neural network can learn around
the parasitic voltage drops by modeling their effect as an injection of only errors that are highly probable and highly significant (i.e., affect-
Gaussian noise on the VMM results during training. The mean and ing only devices holding the MSBs of the weight). The second divide/
variance of this noise source are extracted from circuit simulations of residual unit detects but does not correct any errors that still remain.
an uncompensated crossbar that implements the desired neural This error correction scheme was shown to substantially reduce the
network. misclassification rate on large machine vision tasks (5% on ImageNet
using AlexNet) with only a small area, power, and latency overhead.
B. Handling sparse neural networks
Weight pruning is often used to make neural networks more
compact and reduce their computational overhead. Because removing
a large fraction of the neuron connections results in a relatively sparse
and irregular weight matrix, the resulting network does not map in an
area-efficient way onto a memory crossbar—this can be considered a
limitation of the crossbar’s rigid connectivity pattern. Low utilization
of this connectivity implies that the energy consumption of the periph-
eral circuits is amortized over fewer MAC operations. As one solution,
the neural network can be adapted during the pruning process to an
eventual crossbar implementation by clustering the nonzero elements
of a sparse matrix into blocks that can then be mapped onto multiple
FIG. 22. Error correction unit described in Ref. 164, shown as an added component
smaller crossbars; the zero weights no longer need to lie at the array to the ISAAC accelerator.68 Reproduced with permission from Feinberg et al., in
crosspoints, improving the energy efficiency and potentially the area. IEEE International Symposium on High Performance Computer Architecture
Reference 203 proposes an iterative training scheme with a clustered (HPCA) (2018). Copyright 2018 IEEE.164

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-26


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

Resilience to device-level errors can also be provided through rule) can achieve equivalent or superior generalization performance to
redundancy. Reference 88 considers duplicating the same weight mul- MLPs trained through backpropagation. A layer of fixed random
tiple times within a crossbar to average out the effects of variability weights can potentially also be used during backpropagation, in place
and noise at the column output. Reference 206 addresses stuck-at of the original weight matrix W in Eq. (2), which would reduce the
faults by providing redundant crossbars and redundant columns number of crossbars that need to receive regular weight updates.215
within a crossbar. Reference 207 represents a weight using the total Early work suggests this may work in hardware neural network con-
conductance of multiple PCM devices and cycles among the devices texts,216 but more extensive analysis will be required to test whether
during programming; while this only linearly increases the available this approach is robust to noise and process variations.
number of conductance levels, it also linearly reduces the endurance More accurate and robust device-aware training frameworks can
requirements on each device. be built from measurements on devices that are exposed to a large
To combat conductance drift, Ref. 140 monitors and periodically number of incremental update pulses. These measurements can be
re-adjusts the memristor conductances; this approach introduces sig- collected into a device-specific lookup table that provides a probability
nificant energy overhead to an inference accelerator and requires the distribution function for the update DG for a given initial conductance
movement of data from a host CPU that stores a redundant copy of state G and queried update DGideal .62 In principle, this method can be
the weights. If the drift behavior of the average device is known in used to model effects that are not easily captured by analytic or circuit
advance, its effect has been shown to be partially compensated by models and, in some cases, not explicitly identified at all. The effects of
applying an appropriate slope correction schedule to the activation nonlinearity and device-to-device variability on training have been
functions, making them steeper over time. The effectiveness of the simulated using these lookup tables.62,85,217 Open-source software
slope correction technique decreases as the device-to-device variability tools such as CrossSim218 and NeuroSim219 have been developed,
around the average case increases.208 which allow the integration of device models to perform array-level
Errors that are likely to arise from process variations or defects neural network inference and training simulations. CrossSim supports
can be suppressed by an appropriate re-mapping of the neural network the use of experimentally derived probabilistic lookup tables, while
weights to the hardware. Reference 209 proposes a scheme to assign NeuroSim uses device behavioral models with parameters extracted
the rows of the weight matrix to the crossbar rows in a way that mini- from experiments.
mizes the expected deviations on the column outputs due to cycle-to-
cycle variability. Then, during a re-training phase, weights that remain IX. CONCLUSION
particularly prone to errors are frozen and reduced to zero. Reference By performing computations locally within memory, the analog
210 uses a similar strategy of re-assigning matrix rows to crossbar rows neuromorphic accelerator relieves part of the data transfer bottleneck
to mitigate errors due to stuck-at-0 and stuck-at-1 faults, whose distri- that exists in any digital processor. However, these systems—which
bution is profiled periodically during online training. Both approaches are generally mixed-signal rather than fully analog—are not without
rely on an optimization algorithm to find the best mapping. their own unique challenges. Without a careful design of the system
architecture, the intrinsic advantages of crossbar computation can be
D. Device-aware neural network training swamped by the overhead (in power, area, and latency) incurred by
The impact of device non-idealities on inference performance can the peripheral circuitry that supports it. A main goal in designing these
often be compensated by anticipating their effects during the off-chip architectures is to keep this overhead to a minimum without sacrific-
training of the neural network. Reference 211 found that injection of ing performance.
Gaussian noise into the synaptic weights during training can compensate The greatest energy and area penalty that is paid in an analog
for a proportional degree of read stochasticity during inference. accelerator is the conversion between the analog and digital domains.
Reference 212, which focuses on charge-trap memory, showed that noise Analog signals are needed to perform crossbar computations, while
injection during training can also be employed to improve resilience to digital is needed for internal and external routing. Different architec-
randomly variable rates of weight relaxation due to charge leakage. tures place the analog/digital divide at different points in the system:
References 183 and 213 have meanwhile found that a device-realistic some remain predominantly digital by using bit-sliced inputs and
noise model considerably outperforms more standard forms of noise weights, some are predominantly analog and preserve the digital
(such as uniform or Gaussian) when applied to the VMM outputs dur- domain solely for routing, while others are almost fully analog, with
ing training. Reference 213 further notes that noise robustness can be digitization only at the interface with off-chip digital processing. Some
increased by mapping the dynamic range of the memory devices to a encode information in the time domain to escape the limitations of
weight range that is smaller than that used by a given layer; weights lying analog and avoid the overhead of a conventional ADC—even in these
outside of this range are clipped. This reduces the amount of algorithmic architectures, however, the cost of signal conversion has to be paid.
noise introduced by a given magnitude of physical noise but incurs an A primary trade-off that needs to be made in the design of an
accuracy penalty due to clipping. The optimal clipping threshold can be inference or training accelerator is that between energy efficiency and
learned for each layer during training, along with the weights. precision. Full digital (or software-equivalent) precision is possible in
Several works have investigated re-designing the topology of the an analog system but comes at the cost of greater ADC and/or com-
neural network to better fit the characteristics of imperfect devices: in munication overhead due to bit slicing. Nonetheless, there are various
particular, equivalent performance may be possible with better energy ways to reduce this overhead, such as by encoding the weights to
efficiency using simpler networks. Reference 214 shows that in the reduce ADC precision or by an optimal spatial arrangement of cross-
presence of noise and device nonlinearity, replacing the first layer of bars and their peripheral circuitry to exploit data locality. In accelera-
learned weights with fixed, random weights (and a simpler learning tors that do not keep full precision, the main design challenge is to

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-27


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

choose the correct quantization resolution while maintaining high REFERENCES


neural network accuracy. 1
N. P. Jouppi, C. Young, N. Patil, and D. Patterson, “A domain-specific archi-
Training accelerators demand more precision, and extracting the tecture for deep neural networks,” Commun. ACM 61, 50–59 (2018).
2
requisite degree of precision while exploiting crossbar parallelism A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, “Deep
requires sophisticated methods of mapping real-valued weights to a learning with COTS HPC systems,” in Proceedings of the 30th International
Conference on International Conference on Machine Learning (ICML’13)
set of limited-accuracy emerging memory devices. Handling batch
(2013), Vol. 28, pp. III-1337–III-1345.
training is another major challenge. In analog accelerators, it is the 3
V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deep neural
batch-1 update that makes the most use of crossbar-level parallelism, networks: A tutorial and survey,” Proc. IEEE 105, 2295–2329 (2017).
4
but as in digital hardware, it breaks the architecture-level parallelism A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner,
that is enabled by pipelining. This unhappy trade-off suggests a press- “Survey and benchmarking of machine learning accelerators,” in IEEE High
ing need for batch-capable learning schemes that can also use pipe- Performance Extreme Computing Conference (HPEC) (2019).
5
W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the
lined updates. obvious,” SIGARCH Comput. Archit. News 23, 20–24 (1995).
Analog accelerators also face unique challenges that arise from 6
H. Tsai, S. Ambrogio, P. Narayanan, R. M. Shelby, and G. W. Burr, “Recent
the crossbar geometry and from the individual memory devices that progress in analog memory-based accelerators for deep learning,” J. Phys. D
constitute the crossbar. Non-ideal physical properties of the devices 51, 283001 (2018).
7
can be compensated using architecture-level techniques, such as error G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler, K. Virwani,
M. Ishii, P. Narayanan, A. Fumarola, L. L. Sanches, I. Boybat, M. L. Gallo, K.
correction, periodic carry, occasional reset, and the Tiki-Taka method.
Moon, J. Woo, H. Hwang, and Y. Leblebici, “Neuromorphic computing using
Another general approach that is gaining popularity is to learn around non-volatile memory,” Adv. Phys.: X 2, 89–124 (2017).
these non-idealities by designing the neural network training algo- 8
W. Haensch, T. Gokmen, and R. Puri, “The next generation of deep learning
rithm with the specific devices or array/circuit effects in mind. This hardware: Analog computing,” Proc. IEEE 107, 108–122 (2019).
9
powerful methodology is enabled and accelerated by open-source, J. J. Yang, D. B. Strukov, and D. R. Stewart, “Memristive devices for
physically realistic crossbar modeling tools. computing,” Nat. Nanotechnol. 8, 13 (2013).
10
Z. Sun, G. Pedretti, E. Ambrosi, A. Bricalli, W. Wang, and D. Ielmini,
Two key stumbling blocks must be overcome before the full ben-
“Solving matrix equations in one step with cross-point resistive arrays,” Proc.
efits of analog acceleration can be realized: (1) peripheral circuit energy Natl. Acad. Sci. 116, 4123–4128 (2019).
consumption, area, and latency must be brought to parity or below the 11
I. Richter, K. Pas, X. Guo, R. Patel, J. Liu, E. Ipek, and E. G. Friedman, “Memristive
costs for the crossbar cores, and (2) high neuromorphic performance accelerator for extreme scale linear solvers,” in Government Microcircuit
must be realized with more efficient co-design or re-design of modern Applications and Critical Technology Conference (GOMACTech) (2015).
12
emerging device technologies. To a frustrating extent, one or the other M. N. Bojnordi and E. Ipek, “Memristive Boltzmann machine: A hardware
accelerator for combinatorial optimization and deep learning,” in IEEE
of these goals are often achieved in modern academic demonstrations International Symposium on High Performance Computer Architecture
or accelerator designs, but not both. Thankfully, as device engineers (HPCA) (2016), pp. 1–13.
13
converge toward nanoscale ideal programmable resistors, a cascading S. Kumar, J. P. Strachan, and R. S. Williams, “Chaotic dynamics in nanoscale
set of benefits will occur throughout the design hierarchy, fulfilling (2). NbO2 Mott memristors for analogue computing,” Nature 548, 318–321
Meanwhile, lower device conductance and even steeper access devices (2017).
14
S. Yu, “Neuro-inspired computing with emerging nonvolatile memorys,”
can potentially increase the maximum feasible crossbar size, which
Proc. IEEE 106, 260–285 (2018).
can assist with (1). Finally, a co-mingling of research efforts across the 15
S. Mittal, “A survey of ReRAM-based architectures for processing-in-memory
design hierarchy may suggest entirely new designs that leap over both and neural networks,” Mach. Learn. Knowl. Extr. 1, 75–114 (2018).
16
of these barriers. By better highlighting the system-level issues that Y. Zhang, Z. Wang, J. Zhu, Y. Yang, M. Rao, W. Song, Y. Zhuo, X. Zhang, M.
break the implementation of traditional workhorse artificial intelli- Cui, L. Shen, R. Huang, and J. J. Yang, “Brain-inspired computing with mem-
gence algorithms, we hope to have inspired efforts in this direction. ristors: Challenges in devices, circuits, and systems,” Appl. Phys. Rev. 7,
011308 (2020).
17
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).
ACKNOWLEDGMENTS 18
A. Ng, https://www.coursera.org/learn/machine-learning for “Machine
learning;” accessed 12 August 2019.
T. P. Xiao, C. H. Bennett, B. Feinberg, S. Agarwal, and M. J. 19
J. Nocedal and S. Wright, Numerical Optimization (Springer Science &
Marinella acknowledge support from Sandia’s Laboratory-Directed Business Media, 2006).
20
Research and Development program. This paper describes objective L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in
technical results and analysis. Any subjective views or opinions that Proceedings of the 20th International Conference on Neural Information
Processing Systems (NIPS’07) (Curran Associates, Inc., 2007), pp. 161–168.
might be expressed in the paper do not necessarily represent the views 21
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
of the U.S. Department of Energy or the United States Government. Comput. 9, 1735–1780 (1997).
Sandia National Laboratories is a multimission laboratory managed 22
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
and operated by National Technology Engineering Solutions of Sandia, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-L. Cantin, C. Chao, C. Clark, J.
LLC, a wholly owned subsidiary of Honeywell International Inc., for Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R.
the U.S. Department of Energy’s National Nuclear Security Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt,
D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew,
Administration under Contract No. DE-NA0003525. A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K.
DATA AVAILABILITY Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R.
Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N.
The data that support the findings of this study are available Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G.
within the article. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson,

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-28


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, Smith, J. Thong, P. Y. Xiao, and D. Burger, “A reconfigurable fabric for accel-
and D. H. Yoon, “In-datacenter performance analysis of a tensor processing erating large-scale datacenter services,” IEEE Micro 35, 10–22 (2015).
43
unit,” in Proceedings of the 44th Annual International Symposium on E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T.
Computer Architecture (ISCA ’17) (ACM, New York, 2017), pp. 1–12. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams,
23
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M.
applied to document recognition,” Proc. IEEE 86, 2278–2324 (1998). Ghandi, S. Heil, K. Holohan, A. E. Husseini, T. Juhasz, K. Kagi, R. K. Kovvuri,
24
Y. LeCun and C. Cortes, http://yann.lecun.com/exdb/mnist/ for “MNIST S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S.
handwritten digit database;” accessed 7 December 2019. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz,
25
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. L. Woods, P. Y. Xiao, D. Zhang, R. Zhao, and D. Burger, “Serving DNNs in
Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large real time at datacenter scale with Project Brainwave,” IEEE Micro 38, 8–20
scale visual recognition challenge,” Int. J. Comput. Vision 115, 211–252 (2018).
(2015). 44
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N.
26
X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi, “Scaling for Sun, and O. Temam, “DaDianNao: A machine-learning supercomputer,” in
edge inference of deep neural networks,” Nat. Electron. 1, 216–222 (2018). 47th Annual IEEE/ACM International Symposium on Microarchitecture
27
S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neu- (2014), pp. 609–622.
ral networks with pruning, trained quantization and Huffman coding,” 45
D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,
arXiv:1510.00149 (2015). “Neurocube: A programmable digital neuromorphic architecture with high-
28
M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training deep density 3D memory,” in ACM/IEEE 43rd Annual International Symposium
neural networks with binary weights during propagations,” in Proceedings of on Computer Architecture (ISCA) (2016), pp. 380–392.
the 28th International Conference on Neural Information Processing Systems 46
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao:
(NIPS’15) (MIT Press, Cambridge, MA, 2015), Vol. 2, pp. 3123–3131. A small-footprint high-throughput accelerator for ubiquitous machine-
29
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Imagenet learning,” in Proceedings of the 19th International Conference on Architectural
classification using binary convolutional neural networks,” arXiv:1603.05279 Support for Programming Languages and Operating Systems (ASPLOS ’14)
(2016). (ACM, New York, 2014), pp. 269–284.
30
M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized 47
Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-
neural networks: Training deep neural networks with weights and activations efficient dataflow for convolutional neural networks,” in ACM/IEEE 43rd
constrained to þ 1 or -1,” arXiv:1602.02830 (2016). Annual International Symposium on Computer Architecture (ISCA) (2016),
31
W. Tang, G. Hua, and L. Wang, “How to train a compact binary neural net-
pp. 367–379.
work with high accuracy?” in Proceedings of the Thirty-First AAAI Conference 48
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally,
on Artificial Intelligence (AAAI’17) (AAAI Press, 2017), pp. 2625–2631.
32 “EIE: Efficient inference engine on compressed deep neural network,” in
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning
Proceedings of the 43rd International Symposium on Computer Architecture
with limited numerical precision,” arXiv:1502.02551 (2015).
33 (ISCA ’16) (IEEE Press, Piscataway, NJ, 2016), pp. 243–254.
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. 49
B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.
Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural net-
Hernandez-Lobato, G. Wei, and D. Brooks, “Minerva: Enabling low-power,
works for mobile vision applications,” arXiv:1704.04861 (2017).
34 highly-accurate deep neural network accelerators,” in ACM/IEEE 43rd
Y. Chen, T.-J. Yang, J. Emer, and V. Sze, “Understanding the limitations of existing
Annual International Symposium on Computer Architecture (ISCA) (2016),
energy-efficient design approaches for deep neural networks,” Energy 2, L3 (2018);
pp. 267–278.
available at https://scholar.google.com/scholar?cluster=9880902826603509712&hl 50
R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An ultra-low power
=en&as_sdt=0,32&sciodt=0,32.
35
R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale deep unsupervised learn- convolutional neural network accelerator based on binary weights,” in IEEE
ing using graphics processors,” in Proceedings of the 26th Annual Computer Society Annual Symposium on VLSI (ISVLSI) (2016), pp.
International Conference on Machine Learning (ACM, 2009), pp. 873–880. 236–241.
51
36
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H. Yoo, “UNPU: A 50.6 TOPS/
E. Shelhamer, “cuDNN: Efficient primitives for deep learning,” W unified deep neural network accelerator with 1b-to-16b fully-variable
arXiv:1410.0759 (2014). weight bit-precision,” in IEEE International Solid-State Circuits Conference
37
C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “CNP: An FPGA-based pro- (ISSCC) (2018), pp. 218–220.
52
cessor for convolutional networks,” in International Conference on Field K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S.
Programmable Logic and Applications (2009), pp. 32–37. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Motomura,
38
C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, “BRein memory: A single-chip binary/ternary reconfigurable in-memory
“Neuflow: A runtime reconfigurable dataflow processor for vision,” in CVPR deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE J. Solid-
Workshops (2011), pp. 109–116. State Circuits 53, 983–994 (2018).
53
39
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA- S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A.
based accelerator design for deep convolutional neural networks,” in Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan,
Proceedings of the ACM/SIGDA International Symposium on Field- “SCALEDEEP: A scalable compute architecture for learning and evaluating
Programmable Gate Arrays (FPGA ’15) (ACM, New York, 2015), pp. deep networks,” in ACM/IEEE 44th Annual International Symposium on
161–170. Computer Architecture (ISCA) (2017), pp. 13–26.
40 54
S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynamically S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen,
configurable coprocessor for convolutional neural networks,” in Proceedings “Cambricon: An instruction set architecture for neural networks,” in
of the 37th Annual International Symposium on Computer Architecture (ISCA Proceedings of the 43rd International Symposium on Computer Architecture
’10) (ACM, New York, 2010), pp. 247–257. (ISCA ’16) (IEEE Press, Piscataway, NJ, 2016), pp. 393–405.
41 55
X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A.
“Automated systolic array architecture synthesis for high throughput CNN Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory
inference on FPGAs,” in 54th ACM/EDAC/IEEE Design Automation accelerator for bulk bitwise operations using commodity DRAM technology,”
Conference (DAC) (2017), pp. 1–6. in Proceedings of the 50th Annual IEEE/ACM International Symposium on
42
A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Microarchitecture (MICRO-50 ’17) (ACM, New York, 2017), pp. 273–287.
56
Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “DRISA: A
Hauck, S. Heil, A. Hormati, J. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. DRAM-based reconfigurable in-situ accelerator,” in Proceedings of the 50th

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-29


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

Annual IEEE/ACM International Symposium on Microarchitecture (MICRO- data-intensive computing,” in IEEE/ACM International Conference on
50 ’17) (ACM, New York, 2017), pp. 288–301. Computer-Aided Design (ICCAD) (2018), pp. 1–8.
57 75
C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. C. Hsu, I. Wang, C. Lo, M. Chiang, W. Jang, C. Lin, and T. Hou, “Self-rectify-
Blaauw, and R. Das, “Neural Cache: Bit-serial in-cache acceleration of deep ing bipolar TaOx/TiO2 RRAM with superior endurance over 1012 cycles for
neural networks,” in Proceedings of the 45th Annual International Symposium 3D high-density storage-class memory,” in Symposium on VLSI Technology
on Computer Architecture (ISCA ’18) (IEEE Press, Piscataway, NJ, 2018), pp. (2013), pp. T166–T167.
76
383–396. S. W. Fong, C. M. Neumann, and H.-S. P. Wong, “Phase-change memory:
58
J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier imple- Towards a storage-class memory,” IEEE Trans. Electron Devices 64,
mented in a standard 6T SRAM array,” in IEEE Symposium on VLSI Circuits 4374–4385 (2017).
77
(VLSI-Circuits) (2016), pp. 1–2. E. J. Merced-Grafals, N. Davila, N. Ge, R. S. Williams, and J. P. Strachan,
59
H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A mixed-signal binar- “Repeatable, accurate, and high speed multi-level programming of memristor
ized convolutional-neural-network accelerator integrating dense weight stor- 1T1R arrays for power efficient analog computing applications,”
age and multiplication for reduced data movement,” in IEEE Symposium on Nanotechnology 27, 365202 (2016).
78
VLSI Circuits (IEEE, 2018), pp. 141–142. A. Sebastian, M. L. Gallo, and E. Eleftheriou, “Computational phase-change
60
S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A processing-in- memory: Beyond von Neumann computing,” J. Phys. D 52, 443002 (2019).
79
memory architecture for bulk bitwise operations in emerging non-volatile S. George, S. Kim, S. Shah, J. Hasler, M. Collins, F. Adil, R. Wunderlich, S.
memories,” in 53rd ACM/EDAC/IEEE Design Automation Conference Nease, and S. Ramakrishnan, “A programmable and configurable mixed-
(DAC) (2016), pp. 1–6. mode FPAA SoC,” IEEE Trans. Very Large Scale Integration (VLSI) Syst. 24,
61
X. Yin, X. Chen, M. Niemier, and X. S. Hu, “Ferroelectric FETs-based nonvol- 2253–2261 (2016).
80
atile logic-in-memory circuits,” IEEE Trans. Very Large Scale Integration F. M. Bayat, X. Guo, H. A. Om’mani, N. Do, K. K. Likharev, and D. B.
(VLSI) Syst. 27, 159–172 (2019). Strukov, “Redesigning commercial floating-gate memory for analog comput-
62
M. J. Marinella, S. Agarwal, A. Hsia, I. Richter, R. Jacobs-Gedrim, J. Niroula, ing applications,” in IEEE International Symposium on Circuits and Systems
S. J. Plimpton, E. Ipek, and C. D. James, “Multiscale co-design analysis of (ISCAS) (2015), pp. 1921–1924.
81
energy, latency, area, and accuracy of a ReRAM analog neural training accel- Y. van de Burgt, E. Lubberman, E. J. Fuller, S. T. Keene, G. C. Faria, S.
erator,” IEEE J. Emerging Sel. Top. Circuits Syst. 8, 86–101 (2018). Agarwal, M. J. Marinella, A. A. Talin, and A. Salleo, “A non-volatile organic
63
S. Agarwal, T.-T. Quach, O. Parekh, A. H. Hsia, E. P. DeBenedictis, C. D. electrochemical device as a low-voltage artificial synapse for neuromorphic
James, M. J. Marinella, and J. B. Aimone, “Energy scaling advantages of resis- computing,” Nat. Mater. 16, 414 (2017).
82
tive memory crossbar based computation and its application to sparse M. Jerry, P. Chen, J. Zhang, P. Sharma, K. Ni, S. Yu, and S. Datta,
coding,” Front. Neurosci. 9, 484 (2016). “Ferroelectric FET analog synapse for acceleration of deep neural network
64
B. Li, P. Gu, Y. Shan, Y. Wang, Y. Chen, and H. Yang, “RRAM-based analog training,” in IEEE International Electron Devices Meeting (IEDM) (2017), pp.
approximate computing,” IEEE Trans. Comput.-Aided Des. Integr. Circuits 6.2.1–6.2.4.
83
Syst. 34, 1905–1917 (2015). Y. Wu, B. Lee, and H. P. Wong, “Al2O3-based RRAM using atomic layer
65
M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. deposition (ALD) with 1-lA reset current,” IEEE Electron Device Lett. 31,
Ge, J. J. Yang, and R. S. Williams, “Dot-product engine for neuromorphic 1449–1451 (2010).
84
computing: Programming 1T1M crossbar to accelerate matrix-vector multi- W.-S. Khwa, D. Lu, C.-M. Dou, and M.-F. Chang, “Emerging NVM circuit
plication,” in Proceedings of the 53rd Annual Design Automation Conference techniques and implementations for energy-efficient systems,” in Beyond-
(ACM, 2016), p. 19. CMOS Technologies for Next Generation Computer Design (Springer, 2019),
66
T. Gokmen, M. J. Rasch, and W. Haensch, “Training LSTM networks with pp. 85–132.
85
resistive cross-point devices,” Front. Neurosci. 12, 745 (2018). S. Agarwal, D. Garland, J. Niroula, R. B. Jacobs-Gedrim, A. Hsia, M. S. Van
67
H. Tsai, S. Ambrogio, C. Mackin, P. Narayanan, R. Shelby, K. Rocki, A. Chen, Heukelom, E. Fuller, B. Draper, and M. J. Marinella, “Using floating-gate
and G. Burr, “Inference of long-short term memory networks at software- memory to train ideal accuracy neural networks,” IEEE J. Explor. Solid-State
equivalent accuracy using 2.5M analog phase change memory devices,” in Comput. Devices Circuits 5, 52–57 (2019).
86
Symposium on VLSI Technology (IEEE, 2019), pp. T82–T83. S. Agarwal, S. J. Plimpton, D. R. Hughart, A. H. Hsia, I. Richter, J. A. Cox, C.
68
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, D. James, and M. J. Marinella, “Resistive memory device requirements for a
M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutional neural net- neural algorithm accelerator,” in International Joint Conference on Neural
work accelerator with in-situ analog arithmetic in crossbars,” in Proceedings Networks (IJCNN) (2016), pp. 929–938.
87
of the 43rd International Symposium on Computer Architecture (ISCA ’16) T. Gokmen and Y. Vlasov, “Acceleration of deep neural network training
(IEEE Press, Piscataway, NJ, 2016), pp. 14–26. with resistive cross-point devices: Design considerations,” Front. Neurosci.
69
G. W. Burr, B. N. Kurdi, J. C. Scott, C. H. Lam, K. Gopalakrishnan, and R. S. 10, 333 (2016).
88
Shenoy, “Overview of candidate device technologies for storage-class memo- S. Yu, P. Chen, Y. Cao, L. Xia, Y. Wang, and H. Wu, “Scaling-up resistive syn-
ry,” IBM J. Res. Dev. 52, 449–464 (2008). aptic arrays for neuro-inspired architecture: Challenges and prospect,” in
70
C.-S. Yang, D.-S. Shang, N. Liu, E. J. Fuller, S. Agrawal, A. A. Talin, Y.-Q. Li, IEEE International Electron Devices Meeting (IEDM) (2015), pp.
B.-G. Shen, and Y. Sun, “All-solid-state synaptic transistor with ultralow con- 17.3.1–17.3.4.
89
ductance for neuromorphic computing,” Adv. Funct. Mater. 28, 1804170 H. P. Wong, H. Lee, S. Yu, Y. Chen, Y. Wu, P. Chen, B. Lee, F. T. Chen, and
(2018). M. Tsai, “Metal-oxide RRAM,” Proc. IEEE 100, 1951–1970 (2012).
71 90
T. Yang and V. Sze, “Design considerations for efficient deep neural networks J. Woo, K. Moon, J. Song, S. Lee, M. Kwak, J. Park, and H. Hwang,
on processing-in-memory accelerators,” in IEEE International Electron “Improved synaptic behavior under identical pulses using AlOx/HfO2 bilayer
Devices Meeting (IEDM) (2019), pp. 22.1.1–22.1.4. RRAM array for neuromorphic systems,” IEEE Electron Device Lett. 37,
72
E. J. Fuller, S. T. Keene, A. Melianas, Z. Wang, S. Agarwal, Y. Li, Y. Tuchman, 994–997 (2016).
91
C. D. James, M. J. Marinella, J. J. Yang, A. Salleo, and A. A. Talin, “Parallel J. Park, M. Kwak, K. Moon, J. Woo, D. Lee, and H. Hwang, “TiOx-based
programming of an ionic floating-gate memory array for scalable neuromor- RRAM synapse with 64-levels of conductance and symmetric conductance
phic computing,” Science 364, 570–574 (2019). change by adopting a hybrid pulse scheme for neuromorphic computing,”
73
A. Chen, “A review of emerging non-volatile memory (NVM) technologies IEEE Electron Device Lett. 37, 1559–1562 (2016).
92
and applications,” Solid-State Electron. 125, 25–38 (2016), extended papers K. Moon, A. Fumarola, S. Sidler, J. Jang, P. Narayanan, R. M. Shelby, G. W.
selected from ESSDERC 2015. Burr, and H. Hwang, “Bidirectional non-filamentary RRAM as an analog neu-
74
Y. Long, T. Na, P. Rastogi, K. Rao, A. I. Khan, S. Yalamanchili, and S. romorphic synapse. Part I: Al/Mo/Pr0.7Ca0.3MnO3 material improvements
Mukhopadhyay, “A Ferroelectric FET based power-efficient architecture for and device measurements,” IEEE J. Electron Devices Soc. 6, 146–155 (2018).

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-30


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

93 114
I.-T. Wang, C.-C. Chang, L.-W. Chiu, T. Chou, and T.-H. Hou, “3D Ta/ L. Fick, D. Blaauw, D. Sylvester, S. Skrzyniarz, M. Parikh, and D. Fick, “Analog
TaOx/TiO2/Ti synaptic array and linearity tuning of weight update for hard- in-memory subthreshold deep neural network accelerator,” in IEEE Custom
ware neural network applications,” Nanotechnology 27, 365204 (2016). Integrated Circuits Conference (CICC) (2017), pp. 1–4.
94 115
B. Chakrabarti, M. A. Lastras-Monta~ no, G. Adam, M. Prezioso, B. Hoskins, F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, K. K. Likharev, and D. B.
M. Payvand, A. Madhavan, A. Ghofrani, L. Theogarajan, K.-T. Cheng et al., Strukov, “High-performance mixed-signal neurocomputing with nanoscale
“A multiply-add engine with monolithically integrated 3D memristor cross- floating-gate memory cell arrays,” IEEE Trans. Neural Networks Learn. Syst.
bar/CMOS hybrid circuit,” Sci. Rep. 7, 42429 (2017). 29, 4782–4790 (2018).
95 116
G. C. Adam, B. D. Hoskins, M. Prezioso, F. Merrikh-Bayat, B. Chakrabarti, J. Hasler and H. Marr, “Finding a roadmap to achieve large neuromorphic
and D. B. Strukov, “3-D memristor crossbars for analog and neuromorphic hardware systems,” Front. Neurosci. 7, 118 (2013).
computing applications,” IEEE Trans. Electron Devices 64, 312–318 (2017). 117
M. Bavandpour, S. Sahay, M. R. Mahmoodi, and D. B. Strukov, “3D-aCortex:
96
Z. Li, P. Chen, H. Xu, and S. Yu, “Design of ternary neural network with 3-D An ultra-compact energy-efficient neurocomputing platform based on com-
vertical RRAM array,” IEEE Trans. Electron Devices 64, 2721–2727 (2017). mercial 3D-NAND flash memories,” arXiv:1908.02472 (2019).
97 118
G. W. Burr, M. J. Brightsky, A. Sebastian, H.-Y. Cheng, J.-Y. Wu, S. Kim, N. B. E. Boser, E. Sackinger, J. Bromley, Y. L. Cun, and L. D. Jackel, “An analog
E. Sosa, N. Papandreou, H.-L. Lung, H. Pozidis et al., “Recent progress in neural network processor with programmable topology,” IEEE J. Solid-State
phase-change memory technology,” IEEE J. Emerging Sel. Top. Circuits Syst. Circuits 26, 2017–2025 (1991).
6, 146–162 (2016). 119
R. Genov and G. Cauwenberghs, “Charge-mode parallel architecture for
98
E. J. Fuller, F. E. Gabaly, F. Leonard, S. Agarwal, S. J. Plimpton, R. B. Jacobs- vector-matrix multiplication,” IEEE Trans. Circuits Syst. II 48, 930–936
Gedrim, C. D. James, M. J. Marinella, and A. A. Talin, “Li-ion synaptic tran- (2001).
sistor for low power analog computing,” Adv. Mater. 29, 1604310 (2017). 120
F. J. Kub, K. K. Moon, I. A. Mack, and F. M. Long, “Programmable analog
99
E. J. Fuller, Y. Li, C. Bennet, S. T. Keene, A. Melianas, S. Agarwal, M. J. vector-matrix multipliers,” IEEE J. Solid-State Circuits 25, 207–214 (1990).
Marinella, A. Salleo, and A. A. Talin, “Redox transistors for neuromorphic 121
S. Kim, T. Gokmen, H.-M. Lee, and W. E. Haensch, “Analog CMOS-based resis-
computing,” IBM J. Res. Develop. 63(9), 1–9:9 (2019). tive processing unit for deep neural network training,” IEEE 60th International
100
H. Mulaosmanovic, J. Ocker, S. M€ uller, M. Noack, J. M€uller, P. Polakowski, T. Midwest Symposium on Circuits and Systems (MWSCAS) (2017).
Mikolajick, and S. Slesazeck, “Novel ferroelectric FET based synapse for neu- 122
D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann, “An always-
romorphic systems,” in Symposium on VLSI Technology (2017), pp. on 3.8 lJ/86% CIFAR-10 mixed-signal binary CNN processor with all mem-
T176–T177. ory on chip in 28-nm CMOS,” IEEE J. Solid-State Circuits 54, 158–172 (2019).
101
Y. Long, D. Kim, E. Lee, P. Saha, B. A. Mudassar, X. She, A. I. Khan, and S. 123
E. H. Lee and S. S. Wong, “24.2A 2.5GHz 7.7TOPS/W switched-capacitor
Mukhopadhyay, “A ferroelectric FET based processing-in-memory architec- matrix multiplier with co-designed local memory in 40 nm,” in IEEE
ture for DNN acceleration,” IEEE J. Explor. Solid-State Comput. Devices
International Solid-State Circuits Conference (ISSCC) (2016), pp. 418–419.
Circuits 5, 113–111 (2019). 124
Q. Luo, X. Xu, H. Liu, H. Lv, T. Gong, S. Long, Q. Liu, H. Sun, W. Banerjee, L.
102
B. Obradovic, T. Rakshit, R. Hatcher, J. Kittl, R. Sengupta, J. G. Hong, and M.
Li et al., “Super non-linear RRAM with ultra-low power for 3D vertical nano-
S. Rodder, “A multi-bit neuromorphic weight cell using ferroelectric FETs,
crossbar arrays,” Nanoscale 8, 15629–15636 (2016).
suitable for SoC integration,” IEEE J. Electron Devices Soc. 6, 438–448 (2018). 125
103 R. Midya, Z. Wang, J. Zhang, S. E. Savel’ev, C. Li, M. Rao, M. H. Jang, S. Joshi,
T. P. Ma and J.-P. Han, “Why is nonvolatile ferroelectric memory field-effect
H. Jiang, P. Lin et al., “Anatomy of Ag/Hafnia-based selectors with 1010 non-
transistor still elusive?,” IEEE Electron Device Lett. 23, 386–388 (2002).
104 linearity,” Adv. Mater. 29, 1604457 (2017).
S. Lequeux, J. Sampaio, V. Cros, K. Yakushiji, A. Fukushima, R. Matsumoto, 126
G. W. Burr, R. S. Shenoy, K. Virwani, P. Narayanan, A. Padilla, B. Kurdi, and
H. Kubota, S. Yuasa, and J. Grollier, “A magnetic synapse: Multilevel spin-
H. Hwang, “Access devices for 3D crosspoint memory,” J. Vac. Sci. Technol. B
torque memristor with perpendicular anisotropy,” Sci. Rep. 6, 1–7 (2016).
105 32, 040802 (2014).
C. R. Schlottmann and P. E. Hasler, “A highly dense, low power, programma- 127
M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. C. Adam, K. K. Likharev, and
ble analog vector-matrix multiplier: The FPAA implementation,” IEEE J.
D. B. Strukov, “Training and operation of an integrated neuromorphic net-
Emerging Sel. Top. Circuits Syst. 1, 403–411 (2011).
106
X. Guo, F. M. Bayat, M. Bavandpour, M. Klachko, M. R. Mahmoodi, M. work based on metal-oxide memristors,” Nature 521, 61 (2015).
128
Prezioso, K. K. Likharev, and D. B. Strukov, “Fast, energy-efficient, robust, C. Li, D. Belkin, Y. Li, P. Yan, M. Hu, N. Ge, H. Jiang, E. Montgomery, P. Lin,
and reproducible mixed-signal neuromorphic classifier based on embedded Z. Wang et al., “Efficient and self-adaptive in-situ learning in multilayer mem-
NOR flash memory technology,” in IEEE International Electron Devices ristor neural networks,” Nat. Commun. 9, 2385 (2018).
129
Meeting (IEDM) (2017), pp. 6.5.1–6.5.4. M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang,
107
M. R. Mahmoodi and D. B. Strukov, “Mixed-signal POp/j computing with R. S. Williams, J. J. Yang, Q. Xia, and J. P. Strachan, “Memristor-based analog
nonvolatile memories,” in Proceedings of the Great Lakes Symposium on VLSI computation and neural network classification with a dot product engine,”
(GLSVLSI ’18) (ACM, New York, 2018), pp. 513–513. Adv. Mater. 30, 1705914 (2018).
130
108
C. Diorio, P. Hasler, A. Minch, and C. A. Mead, “A single-transistor silicon F. M. Bayat, M. Prezioso, B. Chakrabarti, H. Nili, I. Kataeva, and D. Strukov,
synapse,” IEEE Trans. Electron Devices 43, 1972–1980 (1996). “Implementation of multilayer perceptron network with highly uniform pas-
109
P. C. Y. Chen, “Threshold-alterable Si-gate MOS devices,” IEEE Trans. sive memristive crossbar circuits,” Nat. Commun. 9, 2331 (2018).
131
Electron Devices 24, 584–586 (1977). F. Cai, J. M. Correll, S. H. Lee, Y. Lim, V. Bothra, Z. Zhang, M. P. Flynn, and
110
Y. J. Park, H. T. Kwon, B. Kim, W. J. Lee, D. H. Wee, H. Choi, B. Park, J. Lee, W. D. Lu, “A fully integrated reprogrammable memristor–CMOS system for
and Y. Kim, “3-D stacked synapse array based on charge-trap flash memory efficient multiply–accumulate operations,” Nat. Electron. 2, 290–299 (2019).
132
for implementation of deep neural networks,” IEEE Trans. Electron Devices P. Yao, H. Wu, B. Gao, S. B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang, N.
66, 420–427 (2019). Deng, L. Shi, H.-S. P. Wong et al., “Face classification using electronic syn-
111 apses,” Nat. Commun. 8, 15199 (2017).
P. Wang, F. Xu, B. Wang, B. Gao, H. Wu, H. Qian, and S. Yu, “Three-dimen-
133
sional NAND flash for vector-matrix multiplication,” IEEE Trans. Very Large S. Yu, Z. Li, P. Chen, H. Wu, B. Gao, D. Wang, W. Wu, and H. Qian, “Binary
Scale Integration (VLSI) Syst. 27, 988–991 (2019). neural network with 16 Mb RRAM macro chip for classification and online
112 training,” in IEEE International Electron Devices Meeting (IEDM) (2016), pp.
R. Chawla, A. Bandyopadhyay, V. Srinivasan, and P. Hasler, “A 531 nW/MHz,
128  32 current-mode programmable analog vector-matrix multiplier with 16.2.1–16.2.4.
134
over two decades of linearity,” in Proceedings of the IEEE Custom Integrated G. W. Burr, R. M. Shelby, S. Sidler, C. di Nolfo, J. Jang, I. Boybat, R. S. Shenoy,
Circuits Conference, Cat. No. 04CH37571 (IEEE, 2004), pp. 651–654. P. Narayanan, K. Virwani, E. U. Giacometti, B. N. Kurdi, and H. Hwang,
113
S. Ramakrishnan and J. Hasler, “Vector-matrix multiply and winner-take-all as “Experimental demonstration and tolerancing of a large-scale neural network
an analog classifier,” IEEE Trans. Very Large Scale Integration (VLSI) Syst. 22, (165 000 synapses) using phase-change memory as the synaptic weight ele-
353–361 (2014). ment,” IEEE Trans. Electron Devices 62, 3498–3507 (2015).

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-31


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

135 153
R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong, “RedEye: Analog A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee, J. Strachan,
convnet image sensor architecture for continuous mobile vision,” in ACM and N. Muralimanohar, “Newton: Gravitating towards the physical limits of
SIGARCH Computer Architecture News (IEEE Press, 2016), Vol. 44, pp. crossbar acceleration,” IEEE Micro 38, 41–49 (2018).
154
255–266. M. Saberi, R. Lotfi, K. Mafinezhad, and W. A. Serdijn, “Analysis of power con-
136
P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: sumption and linearity in capacitive digital-to-analog converters used in suc-
A novel processing-in-memory architecture for neural network computation cessive approximation ADCs,” IEEE Trans. Circuits Syst. I 58, 1736–1748
in ReRAM-based main memory,” in ACM/IEEE 43rd Annual International (2011).
Symposium on Computer Architecture (ISCA) (2016), pp. 27–39. 155
M. Giordano, G. Cristiano, K. Ishibashi, S. Ambrogio, H. Tsai, G. W. Burr,
137
M. Bavandpour, M. R. Mahmoodi, and D. B. Strukov, “Energy-efficient time- and P. Narayanan, “Analog-to-digital conversion with reconfigurable function
domain vector-by-matrix multiplier for neurocomputing and beyond,” IEEE mapping for neural networks activation function acceleration,” IEEE J.
Trans. Circuits Syst. II 66, 1512–1516 (2019). Emerging Sel. Top. Circuits Syst. 9, 367–376 (2019).
138 156
J. Balfour and W. J. Dally, “Design tradeoffs for tiled cmp on-chip networks,” M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, and H. Yang, “Time: A
in ACM International Conference on Supercomputing 25th Anniversary training-in-memory architecture for memristor-based deep neural networks,”
Volume (2006), pp. 390–401. in 54th ACM/EDAC/IEEE Design Automation Conference (DAC) (2017), pp.
139
X. Liu, M. Mao, B. Liu, H. Li, Y. Chen, B. Li, Y. Wang, H. Jiang, M. Barnell, Q. 1–6.
Wu, and J. Yang, “Reno: A high-efficient reconfigurable neuromorphic com- 157
X. Sun, S. Yin, X. Peng, R. Liu, J. Seo, and S. Yu, “XNOR-RRAM: A scalable
puting accelerator design,” in 52nd ACM/EDAC/IEEE Design Automation and parallel resistive synaptic architecture for binary neural networks,” in
Conference (DAC) (2015), pp. 1–6. Design, Automation Test in Europe Conference Exhibition (DATE) (2018),
140
X. Liu, M. Mao, B. Liu, B. Li, Y. Wang, H. Jiang, M. Barnell, Q. Wu, J. Yang, pp. 1423–1428.
H. Li, and Y. Chen, “Harmonica: A framework of heterogeneous computing 158
L. Xia, T. Tang, W. Huangfu, M. Cheng, X. Yin, B. Li, Y. Wang, and H. Yang,
systems with memristor-based neuromorphic computing accelerators,” IEEE “Switched by input: Power efficient structure for RRAM-based convolutional
Trans. Circuits Syst. I 63, 617–628 (2016). neural network,” in 53nd ACM/EDAC/IEEE Design Automation Conference
141
S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. di Nolfo, S. (DAC) (2016), pp. 1–6.
Sidler, M. Giordano, M. Bodini, N. C. Farinha et al., “Equivalent-accuracy 159
M. Santos, N. Horta, and J. Guilherme, “A survey on nonlinear analog-to-digi-
accelerated neural-network training using analogue memory,” Nature 558, 60 tal converters,” Integr. VLSI J. 47, 12–22 (2014).
(2018). 160
E. Rosenthal, S. Greshnikov, D. Soudry, and S. Kvatinsky, “A fully analog
142
M. Hu, H. Li, Q. Wu, and G. S. Rose, “Hardware realization of BSB recall func- memristor-based neural network with online gradient training,” in IEEE
tion using memristor crossbar arrays,” in DAC Design Automation
International Symposium on Circuits and Systems (ISCAS) (2016), pp.
Conference (2012), pp. 498–503.
143 1394–1397.
M. R. Mahmoodi and D. Strukov, “An ultra-low energy internally analog, 161
G. Khodabandehloo, M. Mirhassani, and M. Ahmadi, “Analog implementation
externally digital vector-matrix multiplier based on NOR flash memory tech-
of a novel resistive-type sigmoidal neuron,” IEEE Trans. Very Large Scale
nology,” in 55th ACM/ESDA/IEEE Design Automation Conference (DAC)
Integration (VLSI) Syst. 20, 750–754 (2012).
(2018), pp. 1–6. 162
144 F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High precision tuning of
D. Soudry, D. D. Castro, A. Gal, A. Kolodny, and S. Kvatinsky, “Memristor-
state for memristive devices by adaptable variation-tolerant algorithm,”
based multilayer neural networks with online gradient descent training,” IEEE
Nanotechnology 23, 075201 (2012).
Trans. Neural Networks Learn. Syst. 26, 2408–2421 (2015). 163
145 B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek,
P. Narayanan, L. L. Sanches, A. Fumarola, R. M. Shelby, S. Ambrogio, J. Jang,
“Enabling scientific computing on memristive accelerators,” in ACM/IEEE
H. Hwang, Y. Leblebici, and G. W. Burr, “Reducing circuit design complexity
45th Annual International Symposium on Computer Architecture (ISCA)
for neuromorphic machine learning systems based on non-volatile memory
arrays,” in IEEE International Symposium on Circuits and Systems (ISCAS) (2018), pp. 367–382.
164
(2017), pp. 1–4. B. Feinberg, S. Wang, and E. Ipek, “Making memristive neural network accel-
146
M. Bavandpour, S. Sahay, M. R. Mahmoodi, and D. Strukov, “Efficient mixed- erators reliable,” in IEEE International Symposium on High Performance
signal neurocomputing via successive integration and rescaling,” IEEE Trans. Computer Architecture (HPCA) (2018), pp. 52–65.
165
Very Large Scale Integration (VLSI) Syst. 28, 823–827 (2020). Y. Kim, H. Kim, D. Ahn, and J.-J. Kim, “Input-splitting of large neural net-
147
B. Gordon, “Linear electronic analog/digital conversion architectures, their works for power-efficient accelerator with resistive crossbar memory array,” in
origins, parameters, limitations, and applications,” IEEE Trans. Circuits Syst. Proceedings of the International Symposium on Low Power Electronics and
25, 391–418 (1978). Design (ISLPED ’18) (Association for Computing Machinery, New York,
148
A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. 2018).
166
Faraboschi, W.-M. W. Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic, L. Ni, H. Huang, Z. Liu, R. V. Joshi, and H. Yu, “Distributed in-memory com-
“PUMA: A programmable ultra-efficient memristor-based accelerator for puting on binary RRAM crossbar,” J. Emerging Technol. Comput. Syst. 13,
machine learning inference,” in Proceedings of the Twenty-Fourth 1–36:18 (2017).
167
International Conference on Architectural Support for Programming Languages S. Yin, Y. Kim, X. Han, H. Barnaby, S. Yu, Y. Luo, W. He, X. Sun, J. Kim, and
and Operating Systems (ASPLOS ’19) (ACM, New York, 2019), pp. 715–731. J. Seo, “Monolithically Integrated RRAM- and CMOS-based in-memory com-
149 puting optimizations for efficient deep learning,” EEE Micro 39, 54–63 (2019).
R. Genov and G. Cauwenberghs, “Kerneltron: Support vector “machine” in sil-
168
icon,” IEEE Trans. Neural Networks 14, 1426–1434 (2003). T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, “Binary convolutional neural
150 network on RRAM,” in 22nd Asia and South Pacific Design Automation
J. Hasler, “Analog architecture complexity theory empowering ultra-low
power configurable analog and mixed mode soc systems,” J. Low Power Conference (ASP-DAC) (2017), pp. 782–787.
169
Electron. Appl. 9, 4 (2019). L. Ni, Z. Liu, H. Yu, and R. V. Joshi, “An energy-efficient digital ReRAM-
151 crossbar-based CNN with bitwise parallelism,” IEEE J. Explor. Solid-State
P. Figueiredo, “Recent advances and trends in high-performance embedded
data converters,” in High-Performance AD and DA Converters, IC Design in Comput. Devices Circuits 3, 37–46 (2017).
170
Scaled Technologies, and Time-Domain Signal Processing (Springer, 2015), pp. E. Nurvitadhi, D. Sheffield, J. Sim, A. Mishra, G. Venkatesh, and D. Marr,
85–142. “Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU,
152
L. Kull, D. Luu, C. Menolfi, M. Braendli, P. A. Francese, T. Morf, M. Kossel, and ASIC,” in International Conference on Field-Programmable Technology
H. Yueksel, A. Cevrero, I. Ozkaya et al., “28.5 A 10b 1.5 GS/s pipelined-SAR (FPT) (2016), pp. 77–84.
171
ADC with background second-stage common-mode regulation and offset cali- F. Alibart, E. Zamanidoost, and D. B. Strukov, “Pattern classification by mem-
bration in 14 nm CMOS FinFET,” in IEEE International Solid-State Circuits ristive crossbar circuits using ex situ and in situ training,” Nat. Commun. 4,
Conference (ISSCC) (IEEE, 2017), pp. 474–475. 2072 (2013).

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-32


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

172 193
S. N. Truong and K.-S. Min, “New memristor-based crossbar array architec- M. Suri, O. Bichler, D. Querlioz, O. Cueto, L. Perniola, V. Sousa, D.
ture with 50-% area reduction and 48-% power saving for matrix-vector multi- Vuillaume, C. Gamrat, and B. DeSalvo, “Phase change memory as synapse for
plication of analog neuromorphic computing,” J. Semicond. Technol. Sci. 14, ultra-dense neuromorphic systems: Application to complex visual pattern
356–363 (2014). extraction,” in International Electron Devices Meeting (2011), pp. 4.4.1–4.4.4.
173 194
Y. Zhang, X. Wang, and E. G. Friedman, “Memristor-based circuit design for O. Bichler, M. Suri, D. Querlioz, D. Vuillaume, B. DeSalvo, and C. Gamrat,
multilayer neural networks,” IEEE Trans. Circuits Syst. I 65, 677–686 (2018). “Visual pattern extraction using energy-efficient “2-PCM synapse” neuromor-
174
L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A pipelined reram-based phic architecture,” IEEE Trans. Electron Devices 59, 2206–2214 (2012).
195
accelerator for deep learning,” in IEEE International Symposium on High Y.-P. Lin, C. H. Bennett, T. Cabaret, D. Vodenicarevic, D. Chabi, D. Querlioz, B.
Performance Computer Architecture (HPCA) (2017), pp. 541–552. Jousselme, V. Derycke, and J.-O. Klein, “Physical realization of a supervised learn-
175
See https://developer.nvidia.com/deep-learning-performance-training-inference for ing system built with organic memristive synapses,” Sci. Rep. 6, 31932 (2016).
196
“NVIDIA Data Center Deep Learning Product Performance;” accessed 13 May A. Fumarola, P. Narayanan, L. L. Sanches, S. Sidler, J. Jang, K. Moon, R. M.
2020. Shelby, H. Hwang, and G. W. Burr, “Accelerating machine learning with non-
176
See https://habana.ai/wp-content/uploads/2019/06/Goya-Datasheet-HL-10x.pdf volatile memory: Exploring device and circuit tradeoffs,” in IEEE
for “Habana Labs Goya HL-1000–Inference card;” accessed 13 May 2020. International Conference on Rebooting Computing (ICRC) (2016), pp. 1–8.
177 197
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog- I. Boybat, C. di Nolfo, S. Ambrogio, M. Bodini, N. C. P. Farinha, R. M. Shelby,
nition,” in IEEE Conference on Computer Vision and Pattern Recognition P. Narayanan, S. Sidler, H. Tsai, Y. Leblebici, and G. W. Burr, “Improved deep
(CVPR) (2016), pp. 770–778. neural network hardware-accelerators based on non-volatile-memory: The
178
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, local gains technique,” in IEEE International Conference on Rebooting
Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation sys- Computing (ICRC) (2017), pp. 1–8.
198
tem: Bridging the gap between human and machine translation,” T. Gokmen and W. Haensch, “Algorithm for training neural networks on
arXiv:1609.08144 (2016). resistive device arrays,” Front. Neurosci. 14, 103 (2020).
179 199
V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. R. B. Jacobs-Gedrim, S. Agarwal, R. S. Goeke, C. Smith, P. S. Finnegan, J.
Anderson, M. Breughe, M. Charlebois, W. Chou et al., “MLPerf inference Niroula, D. R. Hughart, P. G. Kotula, C. D. James, and M. J. Marinella,
benchmark,” arXiv:1911.02549 (2019). “Analog high resistance bilayer RRAM device for hardware acceleration of
180
P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and H. Qian, neuromorphic computation,” J. Appl. Phys. 124, 202101 (2018).
200
“Fully hardware-implemented memristor convolutional neural network,” Y. Jeong, M. A. Zidan, and W. D. Lu, “Parasitic effect analysis in memristor-array-
Nature 577, 641–646 (2020). based neuromorphic systems,” IEEE Trans. Nanotechnol. 17, 184–193 (2018).
181 201
M. Collins, J. Hasler, and S. George, “An open-source tool set enabling analog- S. Agarwal, R. L. Schiek, and M. J. Marinella, “Compensating for parasitic
digital-software co-design,” J. Low Power Electron. Appl. 6, 3 (2016). voltage drops in resistive memory arrays,” in IEEE International Memory
182
M. A. Zidan, Y. Jeong, J. H. Shin, C. Du, Z. Zhang, and W. D. Lu, “Field-pro- Workshop (IMW) (2017), pp. 1–4.
202
grammable crossbar array (FPCA) for reconfigurable computing,” Z. He, J. Lin, R. Ewetz, J.-S. Yuan, and D. Fan, “Noise injection adaption:
arXiv:1612.02913 (2016). End-to-end ReRAM crossbar non-ideal effect adaption for neural network
183
I. Kataeva, F. Merrikh-Bayat, E. Zamanidoost, and D. Strukov, “Efficient train- mapping,” in Proceedings of the 56th Annual Design Automation Conference
ing algorithms for neural networks based on memristive crossbar circuits,” in (DAC ’19) (ACM, New York, 2019), pp. 57:1–57:6.
203
International Joint Conference on Neural Networks (IJCNN) (2015), pp. 1–8. A. Ankit, A. Sengupta, and K. Roy, “TraNNsformer: Neural network transfor-
184
S. Choi, J. H. Shin, J. Lee, P. Sheridan, and W. D. Lu, “Experimental demon- mation for memristive crossbar based neuromorphic system design,” in
stration of feature extraction and dimensionality reduction using memristor Proceedings of the 36th International Conference on Computer-Aided Design
networks,” Nano Lett. 17, 3113–3118 (2017). (ICCAD ’17) (IEEE Press, Piscataway, NJ, 2017), pp. 533–540.
185 204
T. D. Sanger, “Optimal unsupervised learning in a single-layer linear feedfor- J. Lin, Z. Zhu, Y. Wang, and Y. Xie, “Learning the sparsity for ReRAM:
ward neural network,” Neural Networks 2, 459–473 (1989). Mapping and pruning sparse neural network for ReRAM based accelerator,”
186
P. Narayanan, A. Fumarola, L. L. Sanches, K. Hosokawa, S. C. Lewis, R. M. in Proceedings of the 24th Asia and South Pacific Design Automation
Shelby, and G. W. Burr, “Toward on-chip acceleration of the backpropagation Conference (ASPDAC ’19) (Association for Computing Machinery, New York,
algorithm using nonvolatile memory,” IBM J. Res. Develop. 61, 11:1–11:11 2019), pp. 639–644.
205
(2017). H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for
187
L. Gao, I.-T. Wang, P.-Y. Chen, S. Vrudhula, J. sun Seo, Y. Cao, T.-H. Hou, efficient convnets,” arXiv:1608.08710 (2016).
206
and S. Yu, “Fully parallel write/read in resistive synaptic array for accelerating W. Huangfu, L. Xia, M. Cheng, X. Yin, T. Tang, B. Li, K. Chakrabarty, Y. Xie,
on-chip learning,” Nanotechnology 26, 455204 (2015). Y. Wang, and H. Yang, “Computation-oriented fault-tolerance schemes for
188
D. Kadetotad, Z. Xu, A. Mohanty, P. Chen, B. Lin, J. Ye, S. Vrudhula, S. Yu, Y. RRAM computing systems,” in 22nd Asia and South Pacific Design
Cao, and J. Seo, “Parallel architecture with resistive crosspoint array for dictio- Automation Conference (ASP-DAC) (2017), pp. 794–799.
207
nary learning acceleration,” IEEE J. Emerging Sel. Top. Circuits Syst. 5, I. Boybat, M. L. Gallo, S. Nandakumar, T. Moraitis, T. Parnell, T. Tuma, B.
194–204 (2015). Rajendran, Y. Leblebici, A. Sebastian, and E. Eleftheriou, “Neuromorphic
189
B. D. Hoskins, M. W. Daniels, S. Huang, A. Madhavan, G. C. Adam, N. computing with multi-memristive synapses,” Nat. Commun. 9, 2514 (2018).
208
Zhitenev, J. J. McClelland, and M. D. Stiles, “Streaming batch eigenupdates for S. Ambrogio, M. Gallot, K. Spoon, H. Tsai, C. Mackin, M. Wesson, S.
hardware neural networks,” Front. Neurosci. 13, 793 (2019). Kariyappa, P. Narayanan, C. Liu, A. Kumar, A. Chen, and G. W. Burr,
190
S. Agarwal, R. B. J. Gedrim, A. H. Hsia, D. R. Hughart, E. J. Fuller, A. A. “Reducing the impact of phase-change memory conductance drift on the
Talin, C. D. James, S. J. Plimpton, and M. J. Marinella, “Achieving ideal accu- inference of large-scale hardware neural networks,” in IEEE International
racies in analog neuromorphic computing using periodic carry,” in Electron Devices Meeting (IEDM) (2019), pp. 6.1.1–6.1.4.
209
Symposium on VLSI Technology (IEEE, 2017), pp. T174–T175. L. Chen, J. Li, Y. Chen, Q. Deng, J. Shen, X. Liang, and L. Jiang, “Accelerator-
191
S. R. Nandakumar, M. L. Gallo, C. Piveteau, V. Joshi, G. Mariani, I. Boybat, G. friendly neural-network training: Learning variations and defects in RRAM
Karunaratne, R. Khaddam-Aljameh, U. Egger, A. Petropoulos, T. Antonakopoulos, crossbar,” in Design, Automation Test in Europe Conference Exhibition
B. Rajendran, A. Sebastian, and E. Eleftheriou, “Mixed-precision deep learning (DATE) (2017), pp. 19–24.
210
based on computational memory,” Front. Neurosci. 14, 406 (2020). L. Xia, M. Liu, X. Ning, K. Chakrabarty, and Y. Wang, “Fault-tolerant training
192
G. W. Burr, P. Narayanan, R. M. Shelby, S. Sidler, I. Boybat, C. di Nolfo, and with on-line fault detection for RRAM-based neural computing systems,” in 54th
Y. Leblebici, “Large-scale neural networks implemented with non-volatile ACM/EDAC/IEEE Design Automation Conference (DAC) (2017), pp. 1–6.
211
memory as the synaptic weight element: Comparative performance analysis V. Joshi, M. L. Gallo, I. Boybat, S. Haefeli, C. Piveteau, M. Dazzi, B. Rajendran,
(accuracy, speed, and power),” in IEEE International Electron Devices A. Sebastian, and E. Eleftheriou, “Accurate deep neural network inference
Meeting (IEDM) (2015), pp. 4.4.1–4.4.4. using computational phase-change memory,” arXiv:1906.03138 (2019).

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-33


C Author(s) 2020
V
Applied Physics Reviews REVIEW scitation.org/journal/are

212 216
C. H. Bennett, T. P. Xiao, R. Dellana, V. Agrawal, B. Feinberg, V. Prabhakar, D. Negrov, I. Karandashev, V. Shakirov, Y. Matveyev, W. Dunin-Barkowski,
K. Ramkumar, L. Hinh, S. Saha, V. Raghavan et al., “Device-aware inference and A. Zenkevich, “An approximate backpropagation learning rule for mem-
operations in SONOS nonvolatile memory arrays,” arXiv:2004.00802 (2020). ristor based neural networks using synaptic plasticity,” Neurocomputing 237,
213
M. Klachko, M. R. Mahmoodi, and D. B. Strukov, “Improving noise tolerance 193–199 (2017).
217
of mixed-signal neural networks,” arXiv:1904.01705 (2019). C. H. Bennett, D. Garland, R. B. Jacobs-Gedrim, S. Agarwal, and M. J.
214
C. H. Bennett, V. Parmar, L. E. Calvet, J. Klein, M. Suri, M. J. Marinella, and Marinella, “Wafer-scale TaOx device variability and implications for neuro-
D. Querlioz, “Contrasting advantages of learning with random weights and morphic computing applications,” in IEEE International Reliability Physics
backpropagation in non-volatile memory neural networks,” IEEE Access 7, Symposium (IRPS) (2019), pp. 1–4.
218
73938–73953 (2019). S. Agarwal, http://cross-sim.sandia.gov for “CrossSim;” accessed 7 December 2019.
215 219
T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, “Random synap- P. Chen, X. Peng, and S. Yu, “Neurosim: A circuit-level macro model for
tic feedback weights support error backpropagation for deep learning,” Nat. benchmarking neuro-inspired architectures in online learning,” IEEE Trans.
Commun. 7, 13276 (2016). Comput.-Aided Des. Integr. Circuits Syst. 37, 3067–3080 (2018).

Appl. Phys. Rev. 7, 031301 (2020); doi: 10.1063/1.5143815 7, 031301-34


C Author(s) 2020
V

You might also like