0% found this document useful (0 votes)
44 views

(Survey) Memory Devices and Applications For In-Memory Computing

Uploaded by

Han Yan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

(Survey) Memory Devices and Applications For In-Memory Computing

Uploaded by

Han Yan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

FOCUS | Review Article

FOCUS | Review Articlehttps://doi.org/10.1038/s41565-020-0655-z

Memory devices and applications for in-memory


computing
Abu Sebastian    ✉, Manuel Le Gallo   , Riduan Khaddam-Aljameh and Evangelos Eleftheriou

Traditional von Neumann computing systems involve separate processing and memory units. However, data movement is costly
in terms of time and energy and this problem is aggravated by the recent explosive growth in highly data-centric applications
related to artificial intelligence. This calls for a radical departure from the traditional systems and one such non-von Neumann
computational approach is in-memory computing. Hereby certain computational tasks are performed in place in the memory
itself by exploiting the physical attributes of the memory devices. Both charge-based and resistance-based memory devices
are being explored for in-memory computing. In this Review, we provide a broad overview of the key computational primitives
enabled by these memory devices as well as their applications spanning scientific computing, signal processing, optimization,
machine learning, deep learning and stochastic computing.

T
oday’s computing systems are primarily built based on the unit could be referred to as in-memory computing. However, the
von Neumann architecture where data must be moved to a key distinction is that at no point during computation is the memory
processing unit. During the execution of various compu- content read back and processed at the granularity of a single mem-
tational tasks, large amounts of data need to be shuttled back and ory element. This latter scenario, where in addition the processing
forth between the processing and memory units and this incurs is performed in close proximity to the memory array, could instead
significant costs in latency and energy. The latency associated with be viewed as near-memory computing. Besides alleviating the costs
accessing data from the memory units is a key performance bottle- in latency and energy associated with data movement, in-memory
neck for a range of applications, in particular for the increasingly computing also has the potential to significantly improve the com-
prominent artificial intelligence (AI) related workloads. There is an putational time complexity associated with certain computational
increasing disparity between the speed of the memory and process- tasks. This arises mostly from the massive parallelism afforded by
ing units, typically referred to as the memory wall1. The energy cost a dense array of millions of memory devices performing computa-
of moving data is another significant challenge given that the com- tion. It is also likely that by introducing physical coupling between
puting systems are severely power limited due to cooling constraints the memory devices, we can further reduce the computational
as well as the proliferation of mobile computing devices. Even at time complexity11. By blurring the boundary between processing
the relatively old 45 nm complementary metal oxide semiconduc- and memory units (an attribute that is also shared with the highly
tor (CMOS) node, the cost of multiplying two numbers is orders energy-efficient mammalian brain where memory and process-
of magnitude lower than that of accessing them from memory2. ing are deeply intertwined12), we gain significant improvements in
The current approaches, such as the use of hundreds of proces- computational efficiency. However, this is at the expense of the gen-
sors in parallel (for example, graphics processing units3) or appli- erality afforded by the conventional approach where memory and
cation-specific processors4,5 that are custom designed for specific processing units are functionally distinct from each other. In this
applications, are not likely to fully overcome the challenge of data Review, we first give an overview of the memory devices that facili-
movement. Hence, it is becoming increasingly evident that novel tate in-memory computing as well as the key in-memory computa-
architectures need to be explored where memory and processing are tional primitives that are enabled. Subsequently, we present a range
better collocated. One prominent idea that dates to the 1990s is that of applications that exploit these primitives. Finally, we present an
of physically placing monolithic compute units closer to a mono- outlook on the opportunities and challenges.
lithic memory6. This concept known as near-memory computing
has benefitted significantly from recent advances in die stacking Memory devices
technology7 and the commercialization of advanced memory mod- Memory is at the heart of in-memory computing. One of the pri-
ules such as the hybrid memory cube (HMC)8 and high bandwidth mary means to store information to date is through the presence
memory (HBM)9. To achieve a denser and more fine-grained con- or absence of charge such as in dynamic random access memory
nectivity between memory and processing units, even three-dimen- (DRAM), static random access memory (SRAM) and flash mem-
sional (3D) monolithic integration has been proposed10. However, ory13. There is also an emerging class of memory devices where
in all of these approaches that aim to reduce the time and distance to information is stored in terms of differences in the atomic arrange-
memory access, there still exists a physical separation between the ments or orientation of ferromagnetic metal layers. Such differences
memory and the compute units. manifest as a change of resistance and these devices are thus termed
In-memory computing is an alternate approach where certain resistive memory devices14. Sometimes they are also referred to as
computational tasks are performed in place in the memory itself memristive devices due to their relation to the circuit theoretic con-
organized as a computational memory unit. As schematically illus- cept of memristive systems15.
trated in Fig. 1, this is achieved by exploiting in tandem the physical One of the primary characteristics of a memory device is the
attributes of the memory devices, their array-level organization, the access time, that is, how fast information can be stored (written)
peripheral circuitry as well as the control logic. Any computational and retrieved (read). Another key characteristic is cycling endur-
task that is realized within the confines of a computational memory ance, which refers to the number of times a memory device can be

IBM Research – Zurich, Rüschlikon, Switzerland. ✉e-mail: [email protected]

Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology 529


Review Article | FOCUS NaTuRe NanoTecHnology

a Processing unit Conventional memory


Data D
Digital interface
Control unit
010001010101011000101010
Periphery

e
Cach
Memory array
ALU (storing D)
f D
11101001010001010100100100
f (D)
Result f (D)

Computation in
processor

b Processing unit Computational memory

Computation in Digital interface


Control unit
memory
Periphery
e Memory array
Cach
(storing D)
f
ALU
1001
Command ("perform f on D”)

Charge-based memory Resistance-based memory

SRAM DRAM Flash RRAM PCM STT-MRAM

Fig. 1 | In-memory computing. a, In a conventional computing system, when an operation f is performed on data D, D has to be moved into a processing
unit, leading to significant costs in latency and energy. b, In the case of in-memory computing, f(D) is performed within a computational memory unit by
exploiting the physical attributes of the memory devices, thus obviating the need to move D to the processing unit. The computational tasks are performed
within the confines of the memory array and its peripheral circuitry, albeit without deciphering the content of the individual memory elements. Both
charge-based memory technologies, such as SRAM, DRAM and flash memory, and resistance-based memory technologies, such as RRAM, PCM and
STT-MRAM, can serve as elements of such a computational memory unit.

switched from one state to the other. The memory devices in a com- which forms a fixed-height barrier, and the FET. Since the maxi-
putational memory unit are usually organized in a two-dimensional mum height of the FET barrier is limited by the band-gap of silicon
(2D) array with horizontal and vertical wires, typically referred to as (≈1.1 eV), the charge can be retained only for a fraction of a second
the word line (WL) and the bit line (BL), used to access them. The and this necessitates periodic refresh. As shown in Fig. 2c, in a Flash
memory array in a computational memory unit can be quite similar memory cell, the charge storage node is coupled to the gate of a FET
to that in a conventional memory unit but with certain differences with charge stored either on a conductive electrode surrounded
in the read/write circuitry, the format of the input/output data as by insulators (floating gate) or in discrete traps within a defective
well as the control logic. For example, depending on the applica- insulator layer (charge trapping layer). Unlike in DRAM, the barrier
tions, multiple WLs need to be activated in parallel or analogue height of the storage node is sufficiently high for long-term data
output currents along BLs need to be sensed precisely. retention. However, the write operation requires high voltages (typi-
cally >10 V) and entails significant latency (>10 µs) due to the need
Charge-based memory. An SRAM cell is a bi-stable transistor to overcome the storage node barriers. Depending on how the flash
structure typically made of two CMOS inverters connected back to memory cells are organized, they are referred to as NOR or NAND
back, as shown in Fig. 2a. The output potential of one inverter is Flash. In NOR Flash, every memory cell is connected to a BL, while
applied as input to the other, forming a feedback loop that freezes in NAND Flash, several memory cells connected in series share a
the cell in a given logical state (0 or 1). Two additional field-effect single connection to the BL. A flash memory cell stores fewer elec-
transistors (FETs) serve as selectors, yielding a standard 6 transistor trons than DRAM and SRAM. Flash memory also has a substan-
(6T) SRAM cell. SRAM is built entirely from FETs and has no dedi- tially lower cycling endurance due to the gate oxide degradation
cated storage element. However, one can view the charge as being under strong electric fields.
confined within the barriers formed by the FET channels and the A range of in-memory logic and arithmetic operations can be per-
gate insulators. Due to the low FET barrier height (0.5 eV), how- formed using both SRAM and DRAM. Capacitive charge redistribu-
ever, the charge constantly needs to be replenished from an external tion serves as the foundation for many of them, in particular storing
source and hence SRAM always needs to be connected to a power and sharing of charge across multiple storage nodes. In DRAM, simul-
supply. A DRAM cell consists of a capacitor placed in series with a taneous reading of devices along multiple rows can be used to exe-
FET (Fig. 2b). The charge is confined within the capacitor insulator, cute basic Boolean functions within the memory array16,17. Figure 2d

530 Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology


NaTuRe NanoTecHnology FOCUS | Review Article
a b c

– –
– – +
– Floating
– – – gate
– ––– –

––– Gate – –––––

– n n n n
– p p

WL WL WL

BL BL C BL BL

Source
~ 103 electrons ~ 105 electrons ~ 100 electrons
Access time < 1 ns Access time < 10 ns Access time (read) < 100 ns
Endurance > 1016 Endurance > 1016 Endurance > 105

d e
BL AND Operation
AND operation
WL VBL VSEL = 0V VREF BL BL
VA VBL VREF
WL
CA

WL B
VB 00 01 10 11 B
00 01 10 11
Bit combinations
CB WL Bit combinations
VBL OR Operation
WL VSEL � 0V NOR operation
VSEL VREF VBL VREF
VREF VREF
A A
CSEL VREF
SA SA
SA 00 01 10 11
VBL VBL 00 01 10 11
Bit combinations
VA + VB + VSEL A AND B A NOR B Bit combinations
VBL =
3

f g
Vin1
1 VDD VDD
1
3 3
2 2 Vin1 Vin2

A11 A21 Vt11 Vt12

SRAM cell SRAM cell


VDS
Vin2 VGS
1 1 IDS Iout1
3 3
2 2 Vt21 Vt22

A12 A22

SRAM cell SRAM cell


Iout2
Vout1 Vout2

Fig. 2 | Charge-based memory devices and computational primitives. a, A 6T SRAM cell consists of two CMOS inverters connected back to back. The
charge is confined within the barriers formed by FET channels and by gate insulators. The stored charge retention is small and an external source constantly
replenishes the lost charge. SRAM has almost unlimited cycling endurance and sub-nanosecond read and write access times. b, A DRAM cell comprises
a capacitor (C) that serves as the storage node, which is connected in series to a FET. c, The storage node of a flash memory cell is coupled to the gate
of a FET. d, Schematic illustration of bit-wise logical operations performed using three DRAM cells. The operands are stored in cells A and B. AND or OR
operations are performed by simultaneously activating the three WLs corresponding to the cells. The logical state of cell SEL is used to dictate whether an
AND or an OR operation is performed, with logical one and zero corresponding to OR and AND operations, respectively. The BL voltage corresponds to
the average voltage across the three capacitors and is sensed using a sense amplifier with a decision threshold voltage of VREF. e, Bit-wise logical operations
using an SRAM array. The BL and BL are pre-charged to the supply voltage, VDD, prior to the execution of the operation. After deactivation of the pre-charge
signal, both the WLs are activated Iso that both BL and BL are discharged at different rates that depend on the data stored in the bit-cells. When the two
activated SRAM cells in a column are both 1 (0), VBL (VBL I ) will be comparable to V , whereas for the other bit combinations, both V and V will be lower
DD BL BL
than VDD. Hence, by sensing VBL and VBL with a SA, AND I and NOR operations are performed, respectively. f, Schematic illustration of performing
I MVM
operation using an array of SRAM cells I and capacitors. The SRAM cells are used to store the elements of the binary matrix. In the first step, the inputs are
provided per row that charges the capacitors on that row to a value proportional to the input. In step two, the capacitors that are associated with the SRAM
elements storing 0s are discharged. Finally, in step three, the capacitors are shorted along the columns performing a charge sharing operation so that the
final voltage on the capacitors corresponds to the analogue MVM result. g, Illustration of an MVM operation performed using Flash memory devices. The
current IDS is a function of the cell’s threshold voltage Vt as well as the drain–source voltage VDS and the gate–source voltage VGS. By fixing VDS, Kirchhoff’s
current law can be employed to perform MVM between a matrix, stored in terms of Vt, and a binary input vector that is used to modulate VGS.

Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology 531


Review Article | FOCUS NaTuRe NanoTecHnology

shows a basic cell configuration that can be used to implement bit- least the 1960s31, key technological demonstrations in the 2000s32–34
wise AND/OR functions. Two memory cells, A and B, are used to gave significant impetus to this technology. Phase change memory
store the operands. The logic state of the third cell, SEL, is set to 0 (PCM), which also dates back to the 1960s35, is based on the prop-
or 1 depending on whether an AND or an OR operation is realized, erty of certain types of materials, such as Ge2Sb2Te5, to undergo
respectively. When all three cells are activated simultaneously, the a Joule heating-induced, rapid and reversible transition from a
bit-line voltage corresponds to the average voltage across the three highly resistive amorphous phase to a highly conductive crystalline
capacitors. This voltage is sensed using a sense amplifier (SA) with phase36,37. As shown in Fig. 3b, a typical PCM device has a mush-
a single decision threshold, which outputs the result of the logical room shape where the bottom electrode confines heat and current.
operation. By using the negated output of the SA to also implement This results in a near-hemispherical shape of the amorphous region
the NOT operation, a functionally complete set of Boolean functions in the HRS state. By crystallizing the amorphous region, the LRS
is obtained. These bit-wise operations can be performed along the state is obtained. A relative newcomer to the resistive memory fam-
entire row of memory devices thus enabling parallel bulk bit-wise ily, magnetoresistive random access memory (MRAM) consists of a
operations. Unlike DRAM, the SRAM cells do not contain a built- magnetic tunnel junction (MTJ) structure with two ferromagnetic
in capacitor and hence the parasitic BL capacitance is used instead metal layers (pinned and free). These layers, for example made of
to enable bulk in-memory logical operations18,19. In Fig. 2e, a basic the CoFeB alloy, are separated by a thin tunnel oxide such as MgO
construct for performing in-place bit-wise logical operations using (Fig. 3c). In the pinned layer, the magnetic polarization is structur-
SRAM is shown. Here, again, both of the WLs are activated simulta- ally fixed to act as a reference, whereas in the free layer it is free to
neously and by sensing the BL and BL with an SA, AND and NOR change during the write operation. Voltage pulses of opposite polar-
operations are performed, respectively. I Besides realizing the logical ity are applied to switch the polarization of the free layer. Depending
primitives, it is also essential to efficiently cascade such operations. To on whether the two ferromagnetic polarizations are parallel or anti-
perform cascadable logic operations using both DRAM and SRAM, parallel, the LRS and HRS states are obtained due to the tunnel mag-
additional cloning or duplication steps need to be enabled, allow- netoresistive effect. Spin transfer torque MRAM (STT-MRAM) is
ing the construction of in-memory full adders and multipliers17,20. currently the most promising MRAM technology38,39. RRAM and
The overhead of having to serially execute the cascaded operations is PCM operate based on the rearrangement of atomic configurations
overcome by the ability to process several bit lines in parallel. and hence have worse access times (write speed) and cycling endur-
SRAM arrays can also be used for matrix-vector multiplication ance than MRAM. However, they have substantially larger resis-
(MVM) operations, Ax = b, where A is the data matrix, x is the input tance windows that enable the storage of intermediate resistances
vector, and b is the output vector21–23. If the elements of A and x are even at an array level. RRAM has the advantage of using materi-
limited to signed binary values, the multiply operation is simplified als that are common in semiconductor manufacturing. However,
to a combination of XNOR and ADD functions. Here, a 12T SRAM in spite of the simplicity of the device concept, a comprehensive
cell can be designed to execute XNOR operations within every mem- understanding of the switching mechanism is still lacking compared
ory cell21. In cases where x is non-binary, one approach is to employ to PCM and MRAM.
capacitors in addition to the SRAM cells22–24. It was recently shown One of the attributes of memristive devices that can be exploited
how 6-bit inputs can be multiplied with binary matrices stored in for computation is their non-volatile binary storage capability.
SRAM22. This involves a three-step process that is illustrated in Logical operations are enabled through the interaction between
Fig. 2f. Note that the additional capacitors and switches could be the voltage and resistance state variables40. One particularly inter-
shared among a group of SRAM cells at the expense of reduced par- esting characteristic of certain memristive logic families is stateful-
allelism and hence operational bandwidth. It is also possible to build ness, where the Boolean variable is represented solely in terms of
the analogue capacitor-based circuits in the vicinity of the SRAM the resistance states41–43. A schematic illustration of one such state-
array to accelerate MVM via near-memory computing25,26. ful memristive logic, MAGIC, that realizes the NOR logic opera-
Flash memory can also be used to perform MVM operations27,28. tion is shown in Fig. 3d44. Both the operands and the result are
The gate voltage is modulated in accordance with a binary input stored in terms of the resistance state variable. Stateful logic can be
vector (see Fig. 2g). The matrix elements are stored as charge on the realized almost entirely in the memory array and has been dem-
floating gate28. Because the devices can be accessed in parallel along onstrated for RRAM41 and STT-MRAM45. Stateful logic is also cas-
a BL, NOR Flash has generally been preferred over NAND Flash cadable, whereby the output from one logical gate can directly feed
for in-memory computing. However, there is recent work describ- into the input of a second logic gate46,47. However, in stateful logic,
ing the use of 3D NAND, consisting of vertically stacked layers of the devices repeatedly get written into during the execution of the
serially connected FLASH devices, whereby each layer of the array logical operations, which is a key drawback due to the associated
encodes a unique matrix29. This approach could help to overcome energy cost and the limited cycling endurance of the devices. H
the scalability issue of NOR Flash, which is difficult to scale beyond ence, there is renewed interest in non-stateful logic such as the one
the 28 nm technology node. shown in Fig. 3e. Here, the logical operands are stored as resistance
values, but the result of the logical operation is computed as a volt-
Resistance-based memory. Memristive devices can be pro- age signal48,49. The operands stay fixed in the memory array and
grammed to be in a low resistance state (LRS) or a high resistance the devices need not be programmed during the evaluation of the
state (HRS) through the application of electrical SET and RESET logical operation. However, the sequential cascading of these logi-
pulses, respectively. There is also the possibility to achieve interme- cal operations requires additional circuits, typically located outside
diate resistance levels in certain types of memristive devices. The of the memory array. Memristive threshold logic is yet another
devices are typically organized in a 2D array and require a selection non-stateful logic family where both the inputs and outputs
device in series with each device to prevent parasitic sneak path cur- are voltage signals and the logical functions are defined using the
rents during writing and reading30. resistance values50.
Resistive random access memory (RRAM) devices comprise The non-volatile storage capability, in particular, the ability to
metal–insulator–metal (MIM) stacks (Fig. 3a) and the resistive store a continuum of conductance values, facilitates the key compu-
switching process typically involves the creation and disruption of tational primitive of analogue MVM51–53. The physical laws that are
conductive filaments (CF) comprising a localized concentration of exploited to perform this operation are Ohm’s law and Kirchhoff ’s
defects. An LRS state corresponds to CFs bridging the two metal current summation laws (Fig. 3f). Memristive devices also exhibit an
layers. Even though the history of RRAM can be traced back to at accumulative behaviour52,54,55, whereby the conductance of devices

532 Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology


NaTuRe NanoTecHnology FOCUS | Review Article
a b c
Top electrode Soft Layer
Top electrode
Crystalline

Conductive Tunnel barrier


Amorphous
filament

Bottom
Bottom electrode electrode Pinned Layer

Resistance range = 103–107 Resistance range = 104–107 Resistance range = 103–104


Access time (write) = 10ns – 100 ns Access time (write) ~ 100 ns Access time (write) < 10 ns
Endurance = 106–109 Endurance = 106–109 Endurance > 1014

d Bit combination = 00 e f
0
Operands 1 1
Iout1
0

VREAD
>2VRESET Result Vin1
GA Iout IREF G11 G21
Bit combination = 01
0 AND Iout2
1 0
IREF Vin2
1
GB G12 G22
Iout OR
Bit combination = 11
1
SA 00 01 10 11 Iout1 Iout2
1 0
1 Bit combinations
Vin1 Vin2

Stateful logic Non-stateful logic Matrix-vector multiplication

Fig. 3 | Resistance-based memory devices and computational primitives. a, An RRAM device in the LRS where the CF comprises a large concentration
of defects for example oxygen vacancies in metal oxides or metallic ions injected from the electrodes. By the application of appropriate voltage pulses,
the defects can be migrated back to the top electrode thus disconnecting the CF and achieving a HRS. b, A mushroom-type PCM device in the HRS
state where the amorphous phase blocks the bottom electrode. To create this state, a RESET pulse is applied that can melt a significant portion of the
phase change material. When the pulse is stopped abruptly, the molten material quenches into the amorphous phase due to glass transition. When a
current pulse of lesser amplitude is applied to the PCM device in the HRS state, a part of the amorphous region crystallizes. By fully crystallizing the
phase change material, the LRS state is obtained. c, An STT-MRAM device with two ferromagnetic layers (pinned and free) separated by a tunnel oxide
layer. The magnetic polarization of the free layer can be changed upon writing. Depending on whether the ferromagnetic polarizations are parallel or
antiparallel, the device assumes a low or high resistance, respectively. The transition to the parallel state takes place directly through conduction electrons,
which are previously spin-polarized by the pinned layer. Subsequently, the magnetic polarization of the free layer is rotated using magnetic momentum
conservation. To switch to the antiparallel state, an opposite voltage, and hence current direction, is employed. d, Schematic illustration of a stateful
NOR logic operation using 3 bipolar memristive devices44. Two devices represent the operands and one represents the result. First, the result device is
initialized to logic 1 (LRS). Subsequently, a voltage pulse with an amplitude larger than twice that of VRESET is applied simultaneously to both the operand
devices. If either operand device is at logic 1 (LRS), then at least half of the voltage drops across the result device and the latter switches to logic 0 (HRS).
Note that, due to the bipolar switching behaviour, the operand devices remain unchanged as long as VSET  2VRESET. When both the operand devices
I Hence it remains at logic 1. Thus, this simple
are at logic 0 (HRS), the voltage dropped across the result device is not sufficient to switch it to logic 0.
circuit implements a NOR operation where all the logic state variables are represented purely in terms of resistance values. e, Non-stateful AND and OR
operations using 2 memristive devices and a variable threshold, SA. By simultaneously activating multiple rows, and with the appropriate choice of current
thresholds, it is possible to implement logical operations such as AND and OR. f, To perform the operation Ax = b, the elements of A are mapped linearly
to the conductance values of memristive devices organized in a crossbar configuration. The x values are mapped linearly to the amplitudes or durations
of read voltages and are applied to the crossbar along the rows. The result of the computation, b, will be proportional to the resulting current measured
along the columns of the array. Note that, if the inputs are mapped onto durations, the result b will be proportional to the total charge (for example,
current integrated over a certain fixed period of time). It is also possible to perform an MVM operation with the transpose of A using the same cross-bar
configuration by applying the input voltage to the column lines and measuring the resulting current along the rows. The negative elements of x are typically
applied as negative voltages whereas the negative elements of A are coded on separate devices together with a subtraction circuit.

such as PCM and RRAM progressively increases or decreases with randomness. A high-level overview of the main applications that
the successive application of appropriate programming pulses. This are being researched for in-memory computing is shown in Fig. 4.
non-volatile accumulative behaviour, in spite of its nonlinear and In-memory computing can be applied both to reduce the computa-
stochastic nature, can be exploited in several applications, such as tional complexity of a problem as well as to reduce the amount of
training deep neural networks, where the conductance values need data being accessed by performing computations inside the memory
to be incrementally modified. arrays. The problems that could benefit the most from the complex-
ity reduction are the NP-hard problems involving combinatorial
Applications optimization. Data-centric applications in machine learning and
The computational primitives reviewed in the Memory Devices scientific computing benefit the most from reduced memory access.
section have been applied to a wide range of application domains, In this section, we review how in-memory computing has been
ranging from scientific computing that requires high preci- applied to those applications and discuss the challenges involved
sion, to stochastic computing that is enabled by imprecision and with respect to the device properties presented previously.

Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology 533


Review Article | FOCUS NaTuRe NanoTecHnology

Data accesses

Low High
Combinatorial
optimization Deep learning
training
Reservoir
computing Spiking neural Solving linear
networks and partial
differential equations
Computational complexity

Random Principal
number Compressed component
generation Deep learning analysis
Sparse sensing
Unclonable inference coding
functions Associative
memory

Image
filtering and
compression

Stochastic Deep learning


Scientific
computing computing
and security Signal processing, optimization and machine learning

Low degree High degree


of precision Computational precision of precision

Fig. 4 | The application landscape for in-memory computing. The applications are grouped into three main categories based on the overall degree of
computational precision that is required. A qualitative measure of the computational complexity and data accesses involved in the different applications
is also shown.

Scientific computing. Linear algebra computational kernels, to multi-level memristive devices. The bit slice approach applied
such as MVM, are common not only to machine learning but to a 16-bit input vector sliced into 16 1-bit slices for increasing
also to scientific computing applications. However, both memris- numerical precision has been demonstrated experimentally where a
tive and charge-based memory devices suffer from significant numerical differential equation solver using a small Ta2O5–x RRAM
inter-device variability and inhomogeneity across an array. 16x3 crossbar array was successfully implemented61.
Moreover, they exhibit intra-device variability and random- Although the bit slice technique appears to address the limita-
ness that is intrinsic to how they operate. Hence, the precision of tions surrounding the precision of analogue MVM operations, there
analogue MVM operations with these devices is rather low. are still inaccuracies arising from the analogue summation along
Although approximate solutions are sufficient for many computa- columns, which potentially could be more detrimental in larger
tional tasks in the domain of AI, building an in-memory computing crossbar arrays. Moreover, the extra peripheral circuitry of the shift-
unit that can effectively address scientific computing and data ana- and-add external reduction networks could substantially increase
lytics problems—which typically require high numerical accuracy— the energy consumption and area. Mixed-precision computing is an
remains challenging. alternate approach to achieve high precision processing based on in-
The aforementioned accuracy limitation can, to a certain extent, memory computing. This approach is based on the well-established
be remedied by an old technique in computer architecture called ‘bit iterative refinement technique for improving a computed solution
slicing’. Bit slicing is a general approach for constructing a proces- to a system of linear equations62. Through this technique, the time
sor from modules of smaller bit width. Each of the modules pro- complexity of iterative linear solvers can be reduced by combining
cesses one bit field or ‘slice’ of an operand56. The grouped processing low-precision with high-precision arithmetic63. The adaptation of
components will then have the capability to process, in parallel, an this concept for in-memory computing and experimental demon-
arbitrarily chosen full word-length of a particular task. This con- stration of solving a system of 5,000 linear equations using 998,752
cept has been proposed for increasing the accuracy of the in-place PCM devices with arbitrarily high accuracy was presented in ref. 64.
MVM based on in-memory computing (Fig. 5a)57–60. According Here, the idea is to use fast but imprecise MVM, via in-memory
to this technique, an n-bit element of the matrix is mapped onto computing in an iterative linear solver, to obtain an approximate
device conductance values of n binary crossbar arrays, that is, n bit solution, and then refine this solution based on the residual error
slices. Thus, each bit slice contains the binary values of the matrix calculated precisely through digital computing (Fig. 5b). The main
elements in a particular bit position. Similarly, bit slicing can also limitation of this technique is that the data need to be stored both in
be applied to the input vector elements, where each bit slice is input crossbar arrays as well as in the memory of a high-precision digital
to the crossbar arrays one at a time. To perform an in-place MVM, processing unit, which increases the resources needed to solve the
a vector bit slice is multiplied with a matrix bit slice, with 0(1) problem. Moreover, the achievable speedup comes from reducing
time complexity, and the partial products of these operations are the number of iterations needed to solve the problem, resulting in
combined outside of the crossbar arrays through a shift-and-add an overall computational complexity of 0(N2) for a NxN matrix, that
reduction network57. Note that the bit slices can also be implemented is, still proportional to the problem size.
on the same crossbar array in a column-by-column manner. In this Several extensions to these two techniques are imaginable to fur-
case, columns at a distance n from each other represent a single bit ther improve the performance benefits and reliability. One way to
slice. Although the above concept has been described based on bit potentially speed up linear solvers further is to realize a one-step lin-
slices, that is, binary memristive arrays, it can easily be generalized ear solver in the analogue domain65, which has been demonstrated

534 Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology


NaTuRe NanoTecHnology FOCUS | Review Article
a b
Low-precision computational memory unit
Input Data Result
T
[0 1 3 2] [3 6 2 1] = 14
Digital interface

MSB LSB MSB LSB

0 0 0 Crossbar array(s)
0 1 1 3 ... ... ... ... ...
... ...
1 0
...
1
... ...

...
...
...

...
...
...

...
...
...

...
...
...
...
...
...
6
... ... ... ... ...
1 1 0
3 1 1
0 1 0 2

2 1 0 Fast imprecise matrix- Iterative refinement to


0 0 1 1 vector multiplication via accurate solution via
computational memory digital processing

Analogue to digital
converters

<<
<< << Input MSB Arithmetic and Control
0 2 1 logic unit unit
+ Memory
6
14 + <<
8 Central processing unit
<<
<< << Input LSB
4 4 0
+
High-precision digital processing unit

Fig. 5 | Increasing the precision of in-memory computing for scientific computing. a, Implementation of the bit slicing concept in a crossbar array for an
inner product operation. The 3-bit data vector is sliced into three 1-bit vectors stored on three separated columns of the crossbar array. The 2-bit input
vector is sliced into two 1-bit vectors sequentially applied to the crossbar array as voltages. The outputs of the crossbar from the first input bit slice go
through an analogue to digital conversion and appropriate shifting prior to accumulation and storage in a local buffer as a partial inner product result. The
second input bit slice undergoes the same process, producing the second partial inner product result. These two partial inner product results are added up,
yielding the final result of the in-place inner-product vector operation. b, The concept of mixed-precision in-memory computing used to iteratively improve
the computed solution to a system of linear equations based on inaccurate MVM operations performed via analogue in-memory computing.

using a 3x3 RRAM crossbar array66. This approach is based on column by column. The compression is then performed by keeping
an old idea of analogue matrix inversion67, whereby a known vec- only a certain ratio of the highest coefficients of the transformed
tor, forced as currents on the columns of the crossbar, establishes image and discarding the rest. Compression experiments using a
an output voltage vector at the rows, which is equal to the prod- 128x64 crossbar array of hafnium oxide (HfO2) devices yielded
uct of the inverse of the conductance matrix multiplied by the vec- reasonably well-reconstructed images, although with a few vis-
tor of currents. Although the high parallelism provided by this ible artefacts due to device non-idealities70. The transform coding
approach is promising, its implementation is hardwired and there- described above for sparsifying large signals is fundamental to com-
fore not scalable, and requires very precise conductance tuning mon compression schemes such as JPEG or MPEG, but can also be
and high linearity of current–voltage characteristics. There are also used for compressed sensing. The basic idea of compressed sens-
initial results on error correction schemes68 as well as extensions ing is to acquire a few (M) sampling measurements from a high-
to the bit-slicing concept for achieving floating-point accuracy69 dimensional signal of size N, and to subsequently recover that signal
on memristive crossbar arrays. These research avenues could accurately. Compressed sensing can be realized via in-memory
enlarge the application space of in-memory computing to encom- computing by encoding the MxN measurement matrix used for this
pass applications in scientific computing where high computational process, which typically contains randomly distributed elements, in
accuracy is required. a crossbar array of memory devices65,71. This array can be used to
perform the MVM operations associated with both the compression
Signal processing, optimization and machine learning. There are and recovery tasks. The efficacy of this scheme has been experimen-
several applications in the domain of signal processing, optimiza- tally demonstrated through 128x128 image compression and recon-
tion and machine learning where approximate solutions can be struction tasks using more than 256,000 PCM devices71. However,
considered acceptable, and the bulk of the computation could thus here as well, device non-idealities such as conductance noise were
be performed with in-memory computing. The crossbar-based ana- found to reduce the reconstruction accuracy.
logue MVM can be used in many applications such as image com- In the field of optimization, a promising application of in-mem-
pression, compressed sensing, combinatorial optimization, sparse ory computing is for combinatorial optimization problems, such as
coding, principal component analysis, associative memories and the travelling salesman problem, Boolean satisfiability and integer
reservoir computing. linear programming. Combinatorial optimization is the process of
The application of in-memory computing to analogue image searching for maxima or minima of an objective function whose
compression has been studied experimentally in ref. 70. The idea domain is a discrete but large configuration space. To address
is to encode a transform matrix, for example, a discrete cosine these computationally intensive typically NP-hard problems, simu-
transform, as the conductance values of devices organized in a lated annealing inspired approaches, such as the massively parallel
crossbar array. The image pixel intensities, represented as voltages, Boltzmann machines and Hopfield networks, have been proposed.
are applied to the crossbar first row by row and, in a second step, The basic idea is to compute the inner products, the fundamental

Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology 535


Review Article | FOCUS NaTuRe NanoTecHnology

building blocks in Boltzmann machines57 or Hopfield networks72, high-dimensional space such that it is possible to classify the input
in place via in-memory computing. For solving the problem, the patterns with a simple linear classifier. One of the approaches to
network is run until convergence, that is, the energy is minimized, implement RC is to feed the input into a fixed physically realized
which involves updating only the state variables, while the weights dynamical system. Memristive devices could play a key role in these
implemented in the crossbar array remain constant. An interesting types of physical RC. For example, Du et al. proposed the use of a
prospect is to utilize the device noise as an explicit source of noise collection of memristive devices with short-term temporal dynam-
to force the network to continuously explore the solution space, ics to serve as the physical reservoir and to classify temporal sig-
which is necessary to achieve proper convergence72,73. However, it nals89. Sebastian et al. used a reservoir of a million PCM devices and
is required to precisely control this noise via an annealing schedule, exploited their accumulative behaviour to classify binary random
which is challenging to implement. Another intriguing approach, processes into correlated and uncorrelated classes90.
going beyond simply accelerating the inner products in recurrent
networks, is to use a network of coupled nonlinear analogue oscil- Deep learning. Recently, deep artificial neural networks, loosely
lators whose dynamics execute an efficient search for solutions of inspired by biological neural networks, have shown a remarkable
combinatorial optimization problems73. Volatile memristive devices human-like performance in tasks such as image processing and
based on Mott insulator–metal transition materials, such as VO2 voice recognition91. A deep neural network (DNN) consists of at
(ref. 74) and NbO2 (ref. 75), as well as spintronic oscillators based on least two layers of nonlinear neuron units interconnected by adjust-
MTJs (ref. 76) can be used to realize compact nanoscale oscillators able synaptic weights. Modern DNNs can have over 1000 layers92.
that facilitate this form of computing. By tuning the adjustable weights, for instance, optimizing them by
Several memory-centric problems in machine learning could using millions of labelled examples, these networks can solve cer-
also benefit from in-memory computing. One is sparse diction- tain problems remarkably well. Dedicated mixed-signal chips that
ary learning, a learning framework in which a sparse representa- could implement multi-layer networks were already developed in
tion of input data is obtained in the form of a linear combination of the early 1990s but were eventually abandoned in favour of field-
basic elements, which form the so-called dictionary of features. As programmable gate arrays (FPGAs) and general-purpose graphics
opposed to the transform coding approach described earlier, both processing units (GPGPUs), partly due to lack of flexibility93. While
the dictionary and the sparse representation are learned from the high-performance GPGPUs are incontestably the hardware that has
input data. If the learned dictionary is mapped onto device conduc- been primarily responsible for the recent success of deep learning,
tance values in a crossbar array, it is possible to obtain the sparse mixed-signal architectures based on in-memory computing are
representation using the iterative-shrinking threshold77 or locally being actively researched, targeting mostly edge computing applica-
competitive algorithms78. The matrix-vector and the transpose- tions where high energy efficiency is critical.
matrix-vector multiplications associated with the algorithms are A DNN can be mapped onto multiple crossbar arrays of memory
performed in the crossbar. Dictionary learning requires updating devices that communicate with each other as illustrated in Fig. 6a.
the conductance values by exploiting the accumulative behaviour A layer of the DNN can be implemented on (at least) one crossbar,
of the memristive devices, based on, for example, stochastic gradi- in which the weights Wij of that layer are stored in the charge or con-
ent descent77,79, which is challenging due to device stochasticity and ductance state of the memory devices at the crosspoints. The propa-
nonlinear conductance change with the number of applied pulses79. gation of data through that layer is performed in a single step by
Another application is principal component analysis, a dimension- inputting the data to the crossbar rows and deciphering the results
ality reduction technique to reveal the internal structure of data at the columns. The results are then passed through the neuron non-
by using a limited number of principal components. It is usually linear function and input to the next layer. The neuron nonlinear
achieved by finding the eigenvectors of the data covariance matrix. function is typically implemented at the crossbar periphery, using
This can be realized using the ‘power iteration’ method in which analogue or digital circuits. Because every layer of the network is
the MVM operations can be performed using in-memory comput- stored physically on different arrays, each array needs to communi-
ing65. An alternative approach is to use a linear feedforward neural cate at least with the array(s) storing the next layer for feed-forward
network in which the weights are implemented in a crossbar array. networks, such as multi-layer perceptrons (MLPs) or convolutional
The network is optimized via unsupervised learning using Sanger’s neural networks (CNNs). For recurrent neural networks (RNNs),
rule to obtain the principal components, given by the weights con- the output of an array needs to communicate with its input. Array-
nected to each output neuron representing the classes in which the to-array communication can be realized using a flexible on-chip
data is clustered80. network, akin to those used in digital DNN accelerators94. However,
Another relevant application for in-memory computing, which their efficient adaptation to in-memory computing based architec-
is used in several machine learning algorithms, is associative mem- tures is still being explored95.
ory. An associative memory compares input search data with the The efficient MVM realized via in-memory computing is very
data stored in it and finds the address of the data with the closest attractive for inference-only applications, where data is propagated
match to the input data81. This capability is used in several learn- through the network on offline-trained weights. With respect to
ing frameworks, such as brain-inspired hyperdimensional com- specialized inference accelerators operating at reduced digital pre-
puting82,83 and memory-augmented neural networks84,85. One way cision (4 to 8-bit), such as Google’s tensor processing unit4 and
to realize associative memory is to use a Hopfield network, which low-power GPGPUs such as NVIDIA T496, in-memory computing
can be trained to minimize the energy of the states that it should aims to improve the energy efficiency even further by eliminat-
remember. This has been successfully demonstrated on small arrays ing the separation between memory and processing for the MVM
of PCM86 and RRAM87 devices. Another more straightforward way operations. Implementations using SRAM-based in-memory com-
to realize associative memory is simply to encode the stored data puting has focused on binary weight networks, in which weights
directly in a crossbar array and compute, in parallel, the Hamming are represented by a single bit97. Various implementations, such
distances of each stored data vector with the input search data vec- as current-based21 and charge-based22,23 computational circuits,
tor via in-memory dot-products88. have been proposed and were able to demonstrate 1-bit arithme-
Finally, the collective dynamics of an ensemble of dynamical tic energy efficiencies of >100 tera operations per second per watt
systems could be exploited to perform certain machine learning (TOPS W−1) for MVM. Chips using in-memory computing on
tasks. One prominent example of this is reservoir computing (RC). non-volatile memory devices have also been fabricated using NOR-
The essential idea of reservoir computing is to map inputs into a Flash28 and RRAM98–100. Using non-volatile memory ensures that

536 Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology


NaTuRe NanoTecHnology FOCUS | Review Article
a

Cow

Computational memory

Peripheral circuits
... ... ...
Digital interface

... ... ...

...

...

...

...

...

...
...

...

...
... ... ...

Control
unit Peripheral circuits

Communication network

b c

... ...

xi
... ...

Σj δjWij
Σj δjWij

Digital processor

xi

...

...

...
...

...

...
xi

xi

Compute
... ∆Wij ...
2: Backward
2: Backward

1: Forward
1: Forward
3: Update

1: Forward Σi xiWij δj Σi xiWij


1: Forward
∆Wij
2: Backward δj 2: Backward δj
3: Update
3: Update δj

Fig. 6 | Deep learning training and inference using in-memory computing. a, Implementation of a feed-forward DNN on multiple crossbar arrays of
memory devices. The synaptic weights Wij are stored in the conductance or charge state of the memory devices. Each layer of the network is implemented
in a different crossbar. Forward propagation of data through the network is performed by applying, for each layer, input data on the crossbar rows, and
deciphering the results at the column level. The results are then passed through a nonlinear function implemented at the periphery and input to the next
layer. A global communication network is used to send data from one array to another. b, A first possible implementation of the three steps performed
in training a layer of a neural network in a crossbar array. Forward and backward propagations are implemented by inputting activations xi and errors δj
on the rows and columns, respectively. An in-place weight update can be performed by sending pulses based on the values of xi and δj from the rows and
columns simultaneously. This implements an approximate outer product and programs the devices at the same time. c, A second possible implementation,
whereby the weight update ∆Wij is computed in the digital domain and applied via programming pulses to the corresponding devices.

the weights will be retained when the power supply is turned off, different chips. Another important research topic is the design
unlike with SRAM. Also, the multi-level storage capability of these of efficient intra- and inter-layer pipelines58 to ensure that all the
devices can be exploited to implement non-binary networks, which arrays on the chip are always active during inference, together with
yield higher accuracy and are easier to train than binary weight net- flexible array-to-array communication and control. It is especially
works. Usually, at least two devices per weight are used in a differ- important for CNNs, in which a large image is passed through small
ential configuration to implement positive and negative weights101. kernels at only a few pixels at a time106, leading to prohibitive laten-
Multiple binary/multi-level devices using the bit-slicing technique cies and buffer requirements if no pipelining is used.
can be used to further increase the precision58,59,98,102. The state-of- In-memory computing can also be used in the context of
the-art experimental demonstrations of DNN inference based on supervised training of DNNs with backpropagation. This training
in-memory computing have reported a competitive energy effi- involves three stages: forward propagation of labelled data through
ciency of ≥10 TOPS W−1 for reduced-precision MVM (Table 1). the network, backward propagation of the error gradients from
Nonetheless, for all these implementations, custom training103–105 output to the input of the network, and weight update based on the
and/or on-chip retraining25,100 of the network is needed to mitigate computed gradients with respect to the weights of each layer. This
the effect of defects, and device and circuit level non-ideality on procedure is repeated over a large dataset of labelled examples for
the network accuracy. The training procedure should be generic multiple epochs until satisfactory performance is reached by the
and as agnostic as possible to the hardware such that the network network. This makes the training of state-of-the-art networks very
would have to be trained only once to be deployed on a multitude of time and energy-consuming even with high-performance GPGPUs.

Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology 537


Review Article | FOCUS NaTuRe NanoTecHnology

Table 1 | State-of-the-art chip-level experimental demonstrations of neural network inference based on in-memory computing
Device SRAM SRAM SRAM nor-Flash RRAM RRAM
CMOS technology 65 nm 65 nm 65 nm 180 nm 130 nm 55 nm
Array size 16 kb 16 kb 2.4 Mb 100 kb 16 kb 1 Mb
Weight/activation precision 1 bit/6 bit 1 bit/ternary 1 bit/1 bit Analogue/analogue Analogue/8 bit 3 bit/2 bit
Network LeNet-5 CNN MLP/CNN 5/9-layer CNN 2-layer MLP 5-layer CNN CNN
Dataset MNIST MNIST/CIFAR-10 MNIST/CIFAR-10 MNIST MNIST CIFAR-10
Accuracy 98.3% 98.3%/85.7% 98.6%/83.3% 94.7% 96.2% 88.52%
Peak MAC efficiency1 40.3 TOPS W−1 139 TOPS W−1 658 TOPS W−1 10 TOPS W−1 11 TOPS W−1 21.9 TOPS W−1
Reference 22 21 23 28 100 98
1
1 multiply-and-accumulate (MAC) = 2 Operations (OPs).
MNIST, Modified national institute of standards and technology database. CIFAR, Canadian institute for advanced research.

When performing training of a neural network encoded in crossbar that operate on a common clock cycle. Hence, SNNs are ideally
arrays, forward propagation is performed in the same way as for the suited for processing spatio-temporal event-driven information
inference described above. The only difference is that all the activa- from neuromorphic sensors. There has been significant progress
tions xi of each layer have to be stored locally in the periphery. Next, in recent years in designing deep SNNs trained with supervised
backward propagation is performed by inputting the error gradient learning that can perform close to conventional DNNs117. The main
δj from the subsequent layer onto the columns of the current layer approaches rely either on converting weights from a previously
and deciphering the result from the rows. The resulting weighted trained non-spiking DNN118,119, or implementing backpropagation
P training using spike signals on the SNN itself120,121. Recently it has
sum δj Wij needs to be multiplied by the derivative of the neuron
j been shown that a spiking neuron can be transformed into a recur-
nonlinear
I function, which is computed externally, to obtain the error rent neural network unit, and thus it is possible to apply the existing
gradient of the current layer. Finally, the weight update is performed deep learning frameworks for seamless training of any SNN archi-
based on the outer product of activations and error gradients xiδj of tecture with backpropagation through time122. However, most of the
each layer. One approach is to perform a parallel weight update by efforts in applying in-memory computing to SNNs have focused on
sending deterministic or stochastic overlapping pulses from the rows unsupervised learning with local learning rules. The best-known
and columns simultaneously to implement an approximate outer example for this is spike-timing-dependent plasticity (STDP),
product and program the devices at the same time (Fig. 6b)107–111. which adjusts a synaptic weight based on the relative timing
While this parallelism may be efficient in terms of speed, each outer between its output and input neuron spikes. In-memory implemen-
product needs to be applied to the arrays one at a time (either after tations of SNNs have traditionally been done using slow subthresh-
every training example or one by one after a batch of examples), old analogue CMOS circuits that directly emulate the functions of
leading to a large number of pulses applied to the devices. This neurons and synapses, together with fast event-driven digital com-
results in stringent requirements on the device granularity, asym- munication12,123. Support for STDP learning was also successfully
metry and linearity to obtain accurate training109,112, and high device implemented124. Non-volatile nanoscale devices, such as PCM125–128
endurance is critical. Using multiple devices per synapse with a peri- and RRAM129,130, have been proposed to be integrated as part of the
odic carry can relax some of the device requirements, at the price synapse and neuron circuits in a hardware SNN. Support for STDP
of a costly reprogramming of the entire array every time the carry learning with these devices has been generally implemented using
is performed110,111. Another approach is a mixed analogue/digital rather complex schemes based on overlapping pulses. However,
weight update whereby ∆Wij is computed digitally and applied to STDP-based learning rules have still not been able to reach the
the arrays row-by-row or column-by-column (Fig. 6c). ∆Wij can be accuracy of conventional DNNs trained with backpropagation,
applied either at every individual training example (online train- despite significant recent progress131. Although SNNs are believed
ing) or batch of training examples (by accumulating all the updates to be computationally more powerful than conventional DNNs
within one batch in a digital memory)113–115. ∆Wij can also be accu- because of the added temporal dimension, an application where this
mulated in a digital memory across batches and specific devices are advantage is clearly demonstrated and exploited is still lacking. This
programmed when their corresponding accumulated values reach a is one of the reasons why generally SNNs have not been as widely
threshold116. This approach is more flexible than the parallel weight adopted as conventional DNNs. However, with the incorporation
update based on overlapping pulses because it can implement any of additional bio-inspired neuronal and synaptic dynamics132, SNNs
learning rule, not only stochastic gradient descent, and the digi- could transcend conventional deep learning in certain application
tal computation and accumulation of weight updates significantly domains and memristive devices could be exploited to natively
relax the requirements on the device granularity and endurance116. implement such dynamics133.
However, the cost is the need for additional digital computing and
memory hardware. The training approaches presented here are still Stochastic computing and security. The stochasticity associ-
at the stage of functionality demonstration and need to overcome ated with the switching behaviour in memristive devices can also
the device-related challenges before they could be employed on edge be exploited for in-memory computing134. In an MRAM, the MTJ
devices in applications where online learning is desirable. switching is inherently stochastic due to the thermal fluctuations
A third application domain for in-memory computing in deep affecting the free layer and the write voltage and duration can be
learning is spiking neural networks (SNNs). The main difference used to tune the switching probability. In RRAM, if the write volt-
between SNNs and the non-spiking neural networks discussed so age is comparable to VSET, then the SET transition takes place after a
far is that SNN neurons compute with asynchronous spikes that are certain time delay. This delay time exhibits significant cycle to cycle
temporally precise, as opposed to continuous-valued activations statistical variations135. This behaviour is also observed in PCM

538 Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology


NaTuRe NanoTecHnology FOCUS | Review Article
devices and is attributed to the threshold switching dynamics as research is needed to efficiently orchestrate the allocation of the
well as the variability associated with the HRS states136,137. In both partial vector components across different arrays and maximize the
RRAM and PCM, the dependence of the delay time on the write areal efficiency in coding sparse matrices on crossbars. The inherent
voltage provides us a means to tune its distribution. PCM exhib- parallelism offered by analogue computations can also potentially
its additional stochasticity associated with crystallization time. It is reduce the computational complexity of a problem. For instance,
attributed to the small variations in the atomic configurations of the NP-hard problems involving combinatorial optimization can ben-
amorphous volume created upon the preceding RESET. This results efit from analogue acceleration of MVMs or using networks of cha-
in variability associated with the number of pulses that are needed otic and nonlinear memristive elements to accelerate the solution
to fully crystallize the amorphous volume137. search. For applications in stochastic computing, in which memris-
Random number generation is important for a variety of areas, tive devices are not employed to reduce data accesses, the overall
such as stochastic computing, data encryption, machine learning benefits can be expected only from the memristive TRNG accelera-
and deep learning138,139. Therefore, there is a significant interest in tion over a conventional implementation. For the logic primitives,
employing memristive devices as an entropy source for a compact performance benefits come from avoiding moving data to a proces-
and efficient true random number generator (TRNG). As opposed sor to perform the logic operations. However, efficiently cascading
to a pseudo-random number generator (PRNG), a TRNG does the logic primitives to perform more complex logic operations, such
not require a seed and uses the entropy arising from physical phe- as a full adder47,154,155 or fixed-point multiplier156, is critical in achiev-
nomena such as Johnson-Nyquist noise, time-dependent dielectric ing end-to-end benefits in applications. Candidate applications in
breakdown or ring oscillator jitter140. The stochastically switching which in-memory logic could be leveraged include database query
memristive device in conjunction with a simple circuitry, compris- and encryption of data157, object detection and evaluation of fast
ing a comparator and some digital logic, can be used to realize a Fourier transforms50 and image processing kernels156.
TRNG (Fig. 7a)141. Several variants of this idea have been explored Computing with charge-based computing devices is attractive
using RRAM142,143, PCM137 and STT-MRAM144,145. due to their technological maturity, even though SRAM has a rela-
The stochastic number streams generated by memristive TRNG tively large areal footprint even at advanced technology nodes and
blocks have also been employed to realize efficient multiply units142. DRAM and Flash memory face severe scaling challenges. Charge-
For example, a multiply operation between two numbers between 0 based analogue computation is inherently subject to thermal noise,
and 1 can be efficiently realized by performing an AND operation which sets an upper limit to the precision achievable for a given
between binary random bit streams representing those numbers138. capacitor size and ambient temperature. Additionally, the manu-
Another interesting application is that of performing probabilistic facturing process introduces non-idealities in the form of capaci-
inference using Bayes’s rule (Fig. 7b). For example, the required tor size variations, thus limiting the maximum achievable accuracy.
probability distributions can be generated as random bit streams Memristive devices, on the other hand, could potentially be scaled
using a stochastically switching MRAM device146. The stochastic- to dimensions of a few nanometers158–161. The key challenges for
ity associated with memristive devices has also found applications memristive devices are write variability and conductance variations.
in spiking neural networks where stochastically firing neurons147,148 Write variability captures the inaccuracies associated with writing
(Fig. 7c) and stochastic binary synapses149 have been proposed. an array of devices to desired conductance values. In RRAM, the
Another promising application is in the domain of security. physical origin of this variability lies mostly in the stochastic nature
A physically unclonable function is a physical system that statisti- of filamentary switching and one prominent approach to counter
cally maps an input digital word to an output one through a secret this is that of establishing preferential paths for CF formation162,163.
key depending on an intrinsically stochastic property of the chip. Representing single computational elements by using multiple mem-
Typically, silicon process variations or the inherent physical vari- ory devices could also mitigate variability issues164. Conductance
ability of device parameters are exploited. PUF can be viewed as variations refer to the temporal and temperature-induced variations
a computational unit that returns an output response, r = f(c), for of the programmed conductance values. One prominent example is
each input challenge, c. f describes the unique internal physical ‘drift’ in PCM devices, which is attributed to the intrinsic structural
characteristics of the PUF. A specific PUF instance is defined by relaxation of the amorphous phase. A promising approach towards
a set of possible challenge-response pairs (CRPs). SRAM devices addressing drift is that of projected phase change memory, which
are commonly used to implement PUF circuitry by exploiting the comprises a non-insulating material segment parallel to the phase
metastable states of cross-coupled inverters150. However, memristive change material segment165,166.
devices organized in a crossbar array can be exploited to design a There are also several challenges to be tackled at the periph-
much stronger PUF with a significantly larger CRP set (Fig. 7d). eral circuit level for in-memory computing. A critical issue is the
The key idea is to exploit the broad distribution of memristive resis- need for digital-to-analogue (analogue-to-digital) conversion every
tance values as well as the exponential number of available current time data goes in to (out of) the crossbar arrays. There are solu-
sneak paths151–153. tions that employ fully analogue peripheral circuits to avoid such
conversions28,111, at the cost of less flexibility and accuracy. Usually,
Opportunities, challenges and perspective the preferred method for inputting digital data to memristive cross-
There are different attributes in the applications discussed in the bars is pulse-width modulation, because the result of the computa-
‘Applications’ section that can be leveraged through in-memory tion based on Ohm’s law will not be affected by the nonlinearity
computing in order to increase the overall system performance. To of the current–voltage characteristics of the devices. For digitizing
take advantage of in-memory computing for MVM, it is preferable the crossbar output, most works have employed analogue-to-digital
for the application to perform many MVMs on large squarish and converters (ADCs)21,22 or sense amplifiers98. The precision of the
dense matrices that stay constant throughout its execution. In this digitization needs to be sufficient to properly resolve the analogue
way, only smaller vector data have to be moved in and out of the multiply–accumulate operations, and a precision of at least four bits
crossbar arrays. This effectively reduces the overall data movement (including sign) has so far been necessary for DNN inference appli-
by eliminating frequent accesses to the matrix data. Applications cations21,22,98. Because of their large area and power consumption,
that fall into this category include deep learning inference, dense it is typically required to multiplex ADCs across multiple columns,
iterative linear solvers, compressed sensing, sparse coding and asso- which increases the latency. Moreover, it is critical to properly scale
ciative memories. Although there has been some work on leverag- the input and output ranges, such that the crossbar output falls
ing sparse MVM through in-memory computing69 as well, more within the limited dynamic range of the ADC; otherwise there

Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology 539


Review Article | FOCUS NaTuRe NanoTecHnology

a b
Rain Wet Grass
Cloudy 0 1 Springler Rain 0 1
Time Time
Pulse 0 0.8 0.2 0 0 1.0 0.0
Clock Cloudy 1 0 0.1 0.9
VREF 0 1 1 0.2 0.8 0 1 0.1 0.9
1 1 0.01 0.99
0.5 0.5 Rain (R)
Time
Cloudy (C) Wet Grass (W)
Switching
delay Springler (S)

....01000100010...
time Springler
Cloudy 0 1
Comparator ....01000110000...
0 0.5 0.5
1 0.9 0.1 AND

AND Write pulse


Duration
Time P(S = 1 W = 1) P(W = 1)
Amplitude
Divide
Counter (even or odd)

1
P(S = 1/W = 1)

LRS HRS

c d
Memristive
membrane
Spike event
generation V
Output spike
train

Dendrites Axon 0
Challenge

Biology Technology
V
Top electrode
– – – – – – – – –
Phase-change 0
Lipid bilayer cell

+ + + + + + + + + I1 I2 IN-1 IN
BE
Current sense amplifier

Response

Fig. 7 | Stochasticity associated with memristive devices and applications in computing. Resistance switching in memristive devices is intrinsically
stochastic, with an ability to control the stochasticity via the voltage and duration of write pulses. a, Schematic illustration of a circuitry that exploits
memristive stochasticity for the generation of true random numbers141. The device is connected in series with a resistor in a voltage divider configuration.
A write pulse of a certain fixed duration is applied to the device. A SET transition in the device after a stochastic delay time will cause the comparator to
output a 1. The difference between the pulse duration and the delay time is measured by a counter in units of a fixed clock period. Based on whether this
time is an even or odd multiple of the clock period, a 0 or 1 bit is assigned. By applying a sequence of write pulses, a stochastic bit stream is generated.
b, A Bayesian network is shown where each node represents random variables and each link describes the direct dependence among them, quantified
in terms of the transitional conditional probabilities. Such networks can be used to estimate the probability of hidden causes from a given observation.
The required probability distributions to perform such probabilistic inference can be generated efficiently using stochastically switching memristive
devices. For example, the probabilities can be encoded within Poisson distributed binary bit streams generated using MRAM devices146. The associated
computations such as the intersection operation can be implemented by multiplying the two bit streams with an AND gate. c, The stochasticity associated
with the SET process in PCM can be used to realize stochastically firing neurons. The key computational element is the neuronal membrane, which stores
the membrane potential in the phase configuration of a PCM device. These devices enable the emulation of large and dense populations of neurons
for bioinspired signal representation and computation. d, Memristive crossbar arrays can be used to generate physically unclonable functions (PUF).
The broad distribution of resistance values as well as the current sneak paths are exploited to obtain a large set of challenge-response pairs (CRP). For
example, in an NxN crossbar PUF depicted here, the challenge consists of an N-bit vector applied to the N rows. The current from the N columns is then
read and converted to an N-bit response. The theoretical number of CRPs is 2N.

would be a prohibitive loss of computational precision. Another only limits the maximum crossbar size that can be reliably oper-
important challenge is the finite resistance of the crossbar wires. It ated, but also the integration density because of the difficulty to use
can lead to parasitic voltage drops on the devices during readout the metal layers close to the CMOS front-end due to their higher
when a high current is flowing through them (referred to as the IR resistivity. From an architectural point of view, a computational
drop), creating errors in the analog computation results. This not memory unit could have multiple in-memory computing cores

540 Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology


NaTuRe NanoTecHnology FOCUS | Review Article
connected through an on-chip network95. Besides the memory formance computing engines such as computational memory. Note
arrays and associated peripheral circuitry, each in-memory com- that most of the memristive device technologies are amenable to
pute core could also have some rudimentary digital processing back end of line integration, thus enabling their integration with
units as well as conventional memory such as registers and buffers. a wide range of front end CMOS technologies. To conclude, in-
There is significant on-going research on defining such hierarchical memory computing, using both charge-based as well as resistance-
organizations of in-memory computing cores to tackle a range of based memory devices, is poised to have a significant impact on
applications58,167,168. Another crucial aspect is the design of a soft- improving the energy/area efficiency as well as the latency com-
ware stack that extends from the user-level application to the low- pared to conventional computing systems and given the condu-
level driver that directly controls the computational memory unit. cive market environment, this could usher in a new era of non-von
The software stack is responsible for transparently compiling, for Neumann computing.
example a machine learning model, into optimized operations and
routing, and orchestrating data movement to and from the unit. Received: 16 September 2019; Accepted: 10 February 2020;
Recent works have started to explore some of these aspects for spe- Published online: 30 March 2020
cific DNN inference workloads168,169.
The specific requirements that the devices need to fulfill when References
employed for computational memory are likely to be different from 1. Mutlu, O., Ghose, S., Gómez-Luna, J. & Ausavarungnirun, R. Processing
those needed for conventional memory and will also be highly data where it makes sense: Enabling in-memory computation. Microprocess.
application dependent. One requirement for memristive devices, Microsyst. 67, 28–41 (2019).
which is common to most computing applications, is that the low- 2. Horowitz, M. Computing’s energy problem (and what we can do about it).
In Proc. International Solid-state Circuits Conference (ISSCC) 10–14
resistance state should be resistive enough to limit the impact of the (IEEE, 2014).
IR drop during writing and readout of the array. For memristive 3. Keckler, S. W., Dally, W. J., Khailany, B., Garland, M. & Glasco, D. GPUs
stateful logic, the requirements include an abrupt, fast and low- and the future of parallel computing. IEEE Micro 31, 7–17 (2011).
power threshold switching characteristic170, high cycling endurance 4. Jouppi, N. P. et al. In-datacenter performance analysis of a tensor
>1012 cycles) as well as low device-to-device variability of switching processing unit. In Proc. International Symposium on Computer Architecture
(ISCA) 1–12 (IEEE, 2017).
voltages and LRS/HRS values. For computational tasks involving 5. Sze, V., Chen, Y.-H., Yang, T.-J. & Emer, J. S. Efficient processing
read-only operations, such as MVM, endurance is much less critical of deep neural networks: A tutorial and survey. Proc. IEEE 105,
as long as the conductance states remain unchanged during their 2295–2329 (2017).
execution. However, a gradual analogue-type switching characteris- 6. Patterson, D. et al. A case for intelligent RAM. IEEE Micro 17, 34–44 (1997).
tic is desirable for programming a continuum of conductance values 7. Farooq, M. et al. 3D copper TSV integration, testing and reliability. In
Proc. International Electron Devices Meeting 7–1 (IEEE, 2011).
in a single device, and temporal conductance variations, device fail- 8. Pawlowski, J. T. Hybrid memory cube (HMC). In Proceedings of the Hot
ures and variability can severely affect the performance171. Gradual, Chips Symposium (HCS) 1–24 (IEEE, 2011).
linear and symmetric conductance changes are also desired in appli- 9. Kim, J. & Kim, Y. HBM: Memory solution for bandwidth-hungry
cations where the device conductance needs to be incrementally processors. In Proc. Hot Chips Symposium (HCS) 1–24 (IEEE, 2014).
modified such as neural network training112. For stochastic com- 10. Shulaker, M. M. et al. Three-dimensional integration of nanotechnologies
for computing and data storage on a single chip. Nature 547, 74 (2017).
puting applications, random device variability is not an issue, but 11. Di Ventra, M. & Pershin, Y. V. The parallel approach. Nat. Phys. 9,
graceful device degradation is137. Moreover, very fast and low-power 200 (2013).
switching devices with high endurance are necessary for being com- 12. Indiveri, G. & Liu, S.-C. Memory and information processing in
petitive with efficient CMOS-based implementations140. neuromorphic systems. Proc. The IEEE 103, 1379–1397 (2015).
13. Zhirnov, V. V. & Marinella, M. J. in Emerging Nanoelectronic Devices
Besides the conventional memory devices presented in this
(eds Chen, A.) Ch. 3 (Wiley Online Library, 2015).
Review, several new memory concepts are being proposed for in- 14. Wong, H.-S. P. & Salahuddin, S. Memory leads the way to better
memory computing172–174. Even though promising, it is difficult to computing. Nat. Nanotechnol. 10, 191 (2015).
fully assess their benefits in the absence of large-scale experimen- 15. Chua, L. Resistance switching memories are memristors. Appl. Phys. A
tal demonstrations and/or integration with CMOS technology. Mater. Sci. Process. 102, 765–783 (2011).
16. Li, S. et al. DRISA: A DRAM-based reconfigurable in-situ accelerator.
Ferroelectric devices, such as ferroelectric random access mem- In Proc. International Symposium on Microarchitecture (MICRO)
ory175, ferroelectric field effect transistors176 and ferroelectric tunnel 288–301 (IEEE, 2017).
junctions177, have also been explored for in-memory computing and 17. Seshadri, V. et al. Ambit: In-memory accelerator for bulk bitwise operations
the newly discovered ferroelectricity in hafnium oxide has given using commodity DRAM technology. In Proc. International Symposium on
significant impetus to this research. There is also a recent interest Microarchitecture 273–287 (IEEE, 2017).
18. Jeloka, S., Akesh, N. B., Sylvester, D. & Blaauw, D. A 28 nm configurable
in photonic memory devices178,179, where data can be written, erased memory (TCAM/BCAM/SRAM) using push-rule 6T bit cell enabling
and read optically. Such devices are being explored for all-photonic logic-in-memory. IEEE J. Solid-State Circuits 51, 1009–1021 (2016).
chip-scale information processing. For example, by integrating 19. Aga, S. et al. Compute caches. In Proc. International Symposium on High
phase-change materials onto an integrated photonics chip, the ana- Performance Computer Architecture (HPCA) 481–492 (IEEE, 2017).
logue multiplication of an incoming optical signal by a scalar value, 20. Wang, J. et al. A compute SRAM with bit-serial integer/floating-point
operations for programmable in-memory vector acceleration. In Proc.
encoded in the state of the phase change material, was performed180. International Solid- State Circuits Conference (ISSCC) 224–226 (IEEE, 2019).
One of the primary advantages of the optical approach is the poten- 21. Jiang, Z., Yin, S., Seok, M. & Seo, J. XNOR-SRAM: In-memory computing
tial for inherent wavelength division multiplexing. SRAM macro for binary/ternary deep neural networks. In Proc. Symposium
The explosive growth of AI, in particular deep neural networks, on VLSI Technology 173–174 (IEEE, 2018).
has created a market for high performance and efficient inference 22. Biswas, A. & Chandrakasan, A. P. CONV-SRAM: an energy-efficient SRAM
with in-memory dot-product computation for low-power convolutional
and training chips, both in the cloud and on the edge. Moreover, neural networks. IEEE J. Solid-State Circuits 54, 217–230 (2019).
mobile devices, which are particularly hampered by energy 23. Valavi, H., Ramadge, P. J., Nestler, E. & Verma, N. A 64-tile 2.4-Mb
constraints, are playing an increasingly important role in defin- in-memory-computing CNN accelerator employing charge-domain
ing the future of computing. Yet another reason is that the cost compute. IEEE J. Solid-State Circuits 54, 1789–1799 (2019).
per transistor is plateauing even though transistor sizes continue 24. Verma, N. et al. In-memory computing: Advances and prospects.
IEEE J. Solid-State Circuits 11, 43–55 (2019).
to get smaller (albeit not at the rate envisaged by Gordon Moore 25. Gonugondla, S. K., Kang, M. & Shanbhag, N. R. A variation-tolerant
anymore). This could prompt many chip manufacturers to sustain in-memory machine learning classifier via on-chip training. IEEE J.
older technology nodes but instead equip the chips with high per- Solid-State Circuits 53, 3163–3173 (2018).

Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology 541


Review Article | FOCUS NaTuRe NanoTecHnology
26. Bankman, D., Yang, L., Moons, B., Verhelst, M. & Murmann, B. An 59. Chi, P. et al. PRIME: A novel processing-in-memory architecture for neural
always-on 3.8 μ J/86% CIFAR-10 mixed-signal binary CNN processor with network computation in ReRAM-based main memory. In Proc. 43rd
all memory on chip in 28-nm CMOS. IEEE J. Solid-State Circuits 54, Annual International Symposium on Computer Architecture (ISCA) News
158–172 (2019). 27–39 (IEEE, 2016).
27. Diorio, C., Hasler, P., Minch, A. & Mead, C. A. A single-transistor silicon 60. Song, L., Qian, X., Li, H. & Chen, Y. PIPELAYER: A pipelined ReRAM-
synapse. IEEE Transactions on Electron Devices 43, 1972–1980 (1996). based accelerator for deep learning. In Proc. The International Symposium
28. Merrikh-Bayat, F. et al. High-performance mixed-signal neurocomputing on High Performance Computer Architecture (HPCA), 541–552 (IEEE, 2017).
with nanoscale floating-gate memory cell arrays. IEEE Trans. Neural Netw. 61. Zidan, M. A. et al. A general memristor-based partial differential equation
Learn. Syst. 29, 4782–4790 (2018). solver. Nat. Electron. 1, 411 (2018).
29. Wang, P. et al. Three-dimensional NAND flash for vector-matrix multiplica­ 62. Higham, N. J. Accuracy and Stability of Numerical Algorithms, Vol. 80
tion. EEE Trans. Very Large Scale Integr. VLSI Syst. 27, 988–991 (2019). (Society for Industrial and Applied Mathematics, 2002).
30. Burr, G. W. et al. Access devices for 3D crosspoint memory. J. Vac. Sci. 63. Bekas, C., Curioni, A. & Fedulova, I. Low cost high performance
Technol. B Nanotechnol. Microelectron. 32, 040802 (2014). uncertainty quantification. In Proc. 2nd Workshop on High Performance
31. Hickmott, T. Low-frequency negative resistance in thin anodic oxide films. Computational Finance 1–8 (ACM, 2009).
J. Appl. Phys. 33, 2669–2682 (1962). 64. Le Gallo, M. et al. Mixed-precision in-memory computing. Nat. Electron. 1,
32. Beck, A., Bednorz, J., Gerber, C., Rossel, C. & Widmer, D. Reproducible 246–253 (2018).
switching effect in thin oxide films for memory applications. Applied Physics 65. Liu, S., Wang, Y., Fardad, M. & Varshney, P. K. A memristor-based
Letters 77, 139–141 (2000). optimization framework for artificial intelligence applications. IEEE Circuits
33. Waser, R. & Aono, M. Nanoionics-based resistive switching memories. and Systems Magazine 18, 29–44 (2018).
Nat. Mater. 6, 833–840 (2007). 66. Sun, Z. et al. Solving matrix equations in one step with cross-point resistive
34. Strukov, D. B., Snider, G. S., Stewart, D. R. & Williams, R. S. The missing arrays. Proc. Natl. Acad. Sci. USA 116, 4123–4128 (2019).
memristor found. Nature 453, 80 (2008). 67. Sturges, R. H. Analog matrix inversion (robot kinematics). IEEE Journal on
35. Ovshinsky, S. R. Reversible electrical switching phenomena in disordered Robotics and Automation 4, 157–162 (1988).
structures. Phys. Rev. Lett. 21, 1450 (1968). 68. Feinberg, B., Wang, S. & Ipek, E. Making memristive neural network
36. Wong, H.-S. P. et al. Phase change memory. Proc. IEEE 98, 2201–2227 (2010). accelerators reliable. In Proc. The International Symposium on High
37. Burr, G. W. et al. Recent progress in phase-change memory technology. Performance Computer Architecture (HPCA) 52–65 (IEEE, 2018).
IEEE J. Emerg. Sel. Top. Circuits Syst. 6, 146–162 (2016). 69. Feinberg, B., Vengalam, U. K. R., Whitehair, N., Wang, S. & Ipek, E.
38. Khvalkovskiy, A. et al. Basic principles of STT-MRAM cell operation in Enabling scientific computing on memristive accelerators. In Proc.
memory arrays. J. Phys. D Appl. Phys. 46, 074001 (2013). International Symposium on Computer Architecture (ISCA) 367–382
39. Kent, A. D. & Worledge, D. C. A new spin on magnetic memories. (IEEE, 2018).
Nat. Nanotechnol. 10, 187 (2015). 70. Li, C. et al. Analogue signal and image processing with large memristor
40. Vourkas, I. & Sirakoulis, G. C. Emerging memristor-based logic circuit crossbars. Nat. Electron. 1, 52–59 (2018).
design approaches: A review. IEEE Circuits and Systems Magazine 16, 71. Le Gallo, M., Sebastian, A., Cherubini, G., Giefers, H. & Eleftheriou, E.
15–30 (2016). Compressed sensing with approximate message passing using in-memory
41. Borghetti, J. et al. Memristive switches enable stateful logic operations via computing. IEEE Trans. Electron Devices 65, 4304–4312 (2018).
material implication. Nature 464, 873 (2010). 72. Cai, F. et al. Harnessing intrinsic noise in memristor hopfield neural
42. Linn, E., Rosezin, R., Tappertzhofen, S., Böttger, U. & Waser, R. Beyond von networks for combinatorial optimization. Preprint at https://arxiv.org/
neumann-logic operations in passive crossbar arrays alongside memory abs/1903.11194 (2019).
operations. Nanotechnology 23, 305205 (2012). 73. Mostafa, H., Müller, L. K. & Indiveri, G. An event-based architecture for
43. Jeong, D. S., Kim, K. M., Kim, S., Choi, B. J. & Hwang, C. S. Memristors solving constraint satisfaction problems. Nat. Commun. 6, 8941 (2015).
for energy-efficient new computing paradigms. Adv. Electron. Mater. 2, 74. Parihar, A., Shukla, N., Jerry, M., Datta, S. & Raychowdhury, A. Vertex
1600090 (2016). coloring of graphs via phase dynamics of coupled oscillatory networks.
44. Kvatinsky, S. et al. MAGIC-memristor-aided logic IEEE Trans. Circuits Syst. Sci. Rep. 7, 911 (2017).
II Express Briefs 61, 895–899 (2014). 75. Kumar, S., Strachan, J. P. & Williams, R. S. Chaotic dynamics in nanoscale
45. Mahmoudi, H., Windbacher, T., Sverdlov, V. & Selberherr, S. Implication NbO 2 Mott memristors for analogue computing. Nature 548, 318 (2017).
logic gates using spin-transfer-torque-operated magnetic tunnel junctions 76. Torrejon, J. et al. Neuromorphic computing with nanoscale spintronic
for intrinsic logic-in-memory. Solid State Electron. 84, 191–197 (2013). oscillators. Nature 547, 428–431 (2017).
46. Kim, K. M. et al. Single-cell stateful logic using a dual-bit memristor. Phys. 77. Seo, J. et al. On-chip sparse learning acceleration with CMOS and resistive
Status Solidi Rapid Res. Lett. 13, 1800629 (2019). synaptic devices. IEEE Trans. Nanotechnol. 14, 969–979 (2015).
47. Xu, N., Fang, L., Kim, K. M. & Hwang, C. S. Time-efficient stateful dual- 78. Sheridan, P. M. et al. Sparse coding with memristor networks.
bit-memristor logic. Phys. Status Solidi Rapid Res. Lett. 13, 1900033 (2019). Nat. Nanotechnol. 12, 784–789 (2017).
48. Li, S. et al. Pinatubo: A processing-in-memory architecture for bulk bitwise 79. Sheridan, P. M., Du, C. & Lu, W. D. Feature extraction using memristor
operations in emerging non-volatile memories. In Proc. The Design networks. IEEE Trans. Neural Netw. Learn. Syst. 27, 2327–2336 (2016).
Automation Conference (DAC) 173 (ACM, 2016). 80. Choi, S., Sheridan, P. & Lu, W. D. Data clustering using memristor
49. Xie, L. et al. Scouting logic: A novel memristor-based logic design for networks. Sci. Rep. 5, 10492 (2015).
resistive computing. In Proc. The IEEE Symposium on VLSI (ISVLSI) 81. Karam, R., Puri, R., Ghosh, S. & Bhunia, S. Emerging trends in design and
176–181 (IEEE, 2017). applications of memory-based computing and content-addressable
50. Maan, A. K., Jayadevi, D. A. & James, A. P. A survey of memristive memories. Proc. IEEE 103, 1311–1330 (2015).
threshold logic circuits. IEEE Trans. Neural Netw. Learn. Syst. 28, 82. Rahimi, A. et al. High-dimensional computing as a nanoscalable paradigm.
1734–1746 (2016). IEEE Trans. Circuits Syst. I Regul. Pap. 64, 2508–2521 (2017).
51. Burr, G. W. et al. Neuromorphic computing using non-volatile memory. 83. Wu, T. F. et al. Hyperdimensional computing exploiting carbon nanotube
Adv Phys X 2, 89–124 (2017). FETs, resistive RAM, and their monolithic 3D integration. IEEE J.
52. Ielmini, D. & Wong, H.-S. P. In-memory computing with resistive switching Solid-State Circuits 53, 3183–3196 (2018).
devices. Nat. Electron. 1, 333 (2018). 84. Graves, A. et al. Hybrid computing using a neural network with dynamic
53. Wang, Z. et al. Resistive switching materials for information processing. external memory. Nature 538, 471 (2016).
Nat. Rev. Mater. https://doi.org/10.1038/s41578-019-0159-3 (2020). 85. Ni, K. et al. Ferroelectric ternary content-addressable memory for one-shot
54. Wright, C. D., Hosseini, P. & Diosdado, J. A. V. Beyond von-neumann learning. Nat. Electron. 2, 521–529 (2019).
computing with nanoscale phase-change memory devices. Adv. Funct. 86. Eryilmaz, S. B. et al. Brain-like associative learning using a nanoscale
Mater. 23, 2248–2254 (2013). non-volatile phase change synaptic device array. Front. Neurosci. 8,
55. Sebastian, A. et al. Brain-inspired computing using phase-change memory 205 (2014).
devices. J. Appl. Phys. 124, 111101 (2018). 87. Hu, S. et al. Associative memory realized by a reconfigurable memristive
56. Godse, A. P. & Godse, D. A. Computer Organization and Architecture Hopfield neural network. Nat. Commun. 6, 7522 (2015).
(Technical Publications, 2008). 88. Kavehei, O. et al. An associative capacitive network based on nanoscale
57. Bojnordi, M. N. & Ipek, E. Memristive boltzmann machine: A hardware complementary resistive switches for memory-intensive computing.
accelerator for combinatorial optimization and deep learning. In Proc. The Nanoscale 5, 5119–5128 (2013).
International Symposium on High Performance Computer Architecture 89. Du, C. et al. Reservoir computing using dynamic memristors for temporal
(HPCA) 1–13 (IEEE, 2016). information processing. Nat. Commun. 8, 2204 (2017).
58. Shafiee, A. et al. ISAAC: A convolutional neural network accelerator with 90. Sebastian, A. et al. Temporal correlation detection using computational
in-situ analog arithmetic in crossbars. Comput. Archit. News 44, 14–26 (2016). phase-change memory. Nat. Commun. 8, 1115 (2017).

542 Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology


NaTuRe NanoTecHnology FOCUS | Review Article
91. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015). 120. Esser, S. K., Appuswamy, R., Merolla,P., Arthur, J. V. & Modha, D. S.
92. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image Backpropagation for energy-efficient neuromorphic computing. In Proc.
recognition. In Proc. Conference on Computer Vision and Pattern Advances in Neural Information Processing Systems (Eds. Cortes, C. et al)
Recognition (CVPR) 770–778 (IEEE, 2016). 1117–1125 (NIPS, 2015).
93. LeCun, Y. Deep learning hardware: Past, present, and future. In Proc. 121. Lee, J. H., Delbruck, T. & Pfeiffer, M. Training deep spiking neural networks
International Solid-State Circuits Conference (ISSCC) 12–19 (IEEE, 2019). using backpropagation. Front. Neurosci. 10, 508 (2016).
94. Chen, Y., Yang, T., Emer, J. & Sze, V. Eyeriss v2: A flexible accelerator for 122. Woźniak, S., Pantazi, A. & Eleftheriou, E. Deep networks incorporating
emerging deep neural networks on mobile devices. IEEE J. Em. Sel. Top. C spiking neural dynamics. Preprint at https://arxiv.org/abs/1812.07040 (2018).
9, 292–308 (2019). 123. Benjamin, B. V. et al. Neurogrid: A mixed-analog-digital multichip system
95. Dazzi, M. et al. 5 parallel prism: A topology for pipelined implementations for large-scale neural simulations. Proc. IEEE 102, 699–716 (2014).
of convolutional neural networks using computational memory. In Proc. 124. Qiao, N. et al. A reconfigurable on-line learning spiking neuromorphic
NeurIPS MLSys Workshop (NeurIPS, 2019); http://learningsys.org/ processor comprising 256 neurons and 128k synapses. Front. Neurosci. 9,
neurips19/acceptedpapers.html 141 (2015).
96. Jia, Z., Maggioni, M., Smith, J. & Scarpazza, D. P. Dissecting the NVidia 125. Kuzum, D., Jeyasingh, R. G., Lee, B. & Wong, H.-S. P. Nanoelectronic
Turing T4 GPU via microbenchmarking. Preprint at https://arxiv.org/ programmable synapses based on phase change materials for brain-inspired
abs/1903.07486 (2019). computing. Nano Letters 12, 2179–2186 (2011).
97. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R. & Bengio, Y. 126. Kim, S. et al. NVM neuromorphic core with 64k-cell (256-by-256) phase
Quantized neural networks: Training neural networks with low precision change memory synaptic array with on-chip neuron circuits for continuous
weights and activations. J. Mach. Learn. Res. 18, 6869–6898 (2017). in-situ learning. In Proc. The International Electron Devices Meeting (IEDM)
98. Xue, C. et al. 24.1 a 1mb multibit ReRAM computing-in-memory macro 17–1 (IEEE, 2015).
with 14.6ns parallel MAC computing time for CNN based AI edge 127. Tuma, T., Le Gallo, M., Sebastian, A. & Eleftheriou, E. Detecting
processors. In Proc. The International Solid-State Circuits Conference correlations using phase-change neurons and synapses. IEEE Electr. Device
(ISSCC) 388–390 (IEEE, 2019). L. 37, 1238–1241 (2016).
99. Hu, M. et al. Memristor-based analog computation and neural network 128. Pantazi, A., Woźniak, S., Tuma, T. & Eleftheriou, E. All-memristive
classification with a dot product engine. Advanced Materials 30, neuromorphic computing with level-tuned neurons. Nanotechnology 27,
1705914 (2018). 355205 (2016).
100. Yao, P. et al. Fully hardware-implemented memristor convolutional neural 129. Covi, E. et al. Analog memristive synapse in spiking networks
network. Nature 577, 641–646 (2020). implementing unsupervised learning. Front. Neurosci. 10, 482 (2016).
101. Suri, M. et al. Phase change memory as synapse for ultra-dense 130. Serb, A. et al. Unsupervised learning in probabilistic neural networks with
neuromorphic systems: Application to complex visual pattern extraction. multi-state metal-oxide memristive synapses. Nat. Commun. 7, 12611 (2016).
In Proc. The International Electron Devices Meeting (IEDM) 4.4.1–44.4 131. Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J. & Masquelier, T.
(IEEE, 2011). STDP-based spiking deep convolutional neural networks for object
102. Chen, W.-H. et al. CMOS-integrated memristive non-volatile computing-in- recognition. Neural Networks 99, 56–67 (2018).
memory for AI edge processors. Nat. Electron. 2, 420–428 (2019). 132. Moraitis, T., Sebastian, A. & Eleftheriou, E. The role of short-term plasticity
103. Murray, A. F. & Edwards, P. J. Enhanced mlp performance and fault in neuromorphic learning: Learning from the timing of rate-varying events
tolerance resulting from synaptic weight noise during training. IEEE T. with fatiguing spike-timing-dependent plasticity. IEEE Nanotechnology
Neural Networ. 5, 792–802 (1994). Magazine 12, 45–53 (2018).
104. Liu, B. et al. Vortex: Variation-aware training for memristor X-bar. In 133. Wang, Z. et al. Memristors with diffusive dynamics as synaptic emulators
Proc. The Design Automation Conference (DAC) 1–6 (DAC, 2015). for neuromorphic computing. Nat. Mater. 16, 101 (2017).
105. Sebastian, A. et al. Computational memory-based inference and training of 134. Carboni, R. & Ielmini, D. Stochastic memory devices for security and
deep neural networks. In Proc. The Symposium on VLSI Technology computing. Adv. Electron. Mater. 1900198 (2019).
T168–T169 (IEEE, 2019). 135. Jo, S. H., Kim, K.-H. & Lu, W. Programmable resistance switching in
106. Gokmen, T., Onen, M. & Haensch, W. Training deep convolutional nanoscale two-terminal devices. Nano letters 9, 496–500 (2008).
neural networks with resistive cross-point devices. Front. Neurosci. 11, 136. Le Gallo, M., Athmanathan, A., Krebs, D. & Sebastian, A. Evidence for
538 (2017). thermally assisted threshold switching behavior in nanoscale phase-change
107. Alibart, F., Zamanidoost, E. & Strukov, D. B. Pattern classification by memory cells. J. Appl. Phys. 119, 025704 (2016).
memristive crossbar circuits using ex situ and in situ training. Nat. Commun. 137. Le Gallo, M., Tuma, T., Zipoli, F., Sebastian, A. & Eleftheriou, E. Inherent
4, 2072 (2013). stochasticity in phase-change memory devices. In Proc. 2016 46th European
108. Burr, G. W. et al. Experimental demonstration and tolerancing of a Solid-State Device Research Conference (ESSDERC) 373–376 (IEEE, 2016).
large-scale neural network (165 000 synapses) using phase-change 138. Alaghi, A. & Hayes, J. P. Survey of stochastic computing. ACM T Embed.
memory as the synaptic weight element. IEEE T. Electron Dev. 62, Comput. S. 12, 92 (2013).
3498–3507 (2015). 139. Gupta, S., Agrawal, A., Gopalakrishnan, K. & Narayanan, P. Deep learning
109. Gokmen, T. & Vlasov, Y. Acceleration of deep neural network training with limited numerical precision. In Proc. International Conference on
with resistive cross-point devices: design considerations. Front. Neurosci. 10, Machine Learning 1737–1746 (2015).
333 (2016). 140. Yang, K. et al. 16.3 a 23mb/s 23pj/b fully synthesized true-random-number
110. Agarwal, S. et al. Achieving ideal accuracies in analog neuromorphic generator in 28nm and 65nm CMOS. In Proc. Proceedings of the
computing using periodic carry. In Proc. The Symposium on VLSI International Solid-State Circuits Conference (ISSCC) 280–281 (IEEE, 2014).
Technology T174–T175 (IEEE, 2017). 141. Jiang, H. et al. A novel true random number generator based on a
111. Ambrogio, S. et al. Equivalent-accuracy accelerated neural-network training stochastic diffusive memristor. Nat. Commun. 8, 882 (2017).
using analogue memory. Nature 558, 60–67 (2018). 142. Gaba, S., Sheridan, P., Zhou, J., Choi, S. & Lu, W. Stochastic memristive
112. Yu, S. Neuro-inspired computing with emerging nonvolatile memory. devices for computing and neuromorphic applications. Nanoscale 5,
Proc. IEEE 106, 260–285 (2018). 5872–5878 (2013).
113. Prezioso, M. et al. Training and operation of an integrated neuromorphic 143. Balatti, S. et al. Physical unbiased generation of random numbers with coupled
network based on metal-oxide memristors. Nature 521, 61–64 (2015). resistive switching devices. IEEE T. Electron Dev. 63, 2029–2035 (2016).
114. Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 144. Choi, W. H. et al. A magnetic tunnel junction based true random number
15199 (2017). generator with conditional perturb and real-time output probability tracking.
115. Li, C. et al. Efficient and self-adaptive in-situ learning in multilayer In Proc. The International Electron Devices Meeting 12–5 (IEEE, 2014).
memristor neural networks. Nat. Commun. 9, 2385 (2018). 145. Carboni, R. et al. Random number generation by differential read of
116. Nandakumar, S. et al. Mixed-precision architecture based on computational stochastic switching in spin-transfer torque memory. IEEE Electr. Device L.
memory for training deep neural networks. In Proc. The International 39, 951–954 (2018).
Symposium on Circuits and Systems (ISCAS) 1–5 (IEEE, 2018). 146. Shim, Y., Chen, S., Sengupta, A. & Roy, K. Stochastic spin-orbit torque
117. Pfeiffer, M. & Pfeil, T. Deep learning with spiking neurons: Opportunities devices as elements for bayesian inference. Sci. Rep. 7, 14101 (2017).
and challenges. Front. Neurosci. 12, 774 (2018). 147. Tuma, T., Pantazi, A., Le Gallo, M., Sebastian, A. & Eleftheriou, E.
118. Diehl, P. U. et al. Fast-classifying, high-accuracy spiking deep networks Stochastic phase-change neurons. Nat. Nanotechnol. 11, 693–699 (2016).
through weight and threshold balancing. In Proc. International Joint 148. Mizrahi, A. et al. Neural-like computing with populations of
Conference on Neural Networks (IJCNN) 1–8 (IEEE, 2015). superparamagnetic basis functions. Nat. Commun. 9, 1533 (2018).
119. Sengupta, A., Ye, Y., Wang, R., Liu, C. & Roy, K. Going deeper in spiking 149. Bichler, O. et al. Visual pattern extraction using energy-efficient 2-PCM
neural networks: VGG and residual architectures. Front. Neurosci. 13, synapse neuromorphic architecture. IEEE T. Electron Dev. 59,
95 (2019). 2206–2214 (2012).

Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology 543


Review Article | FOCUS NaTuRe NanoTecHnology
150. Holcomb, D. E., Burleson, W. P. & Fu, K. Power-up SRAM state as an 170. Yoon, K. J., Bae, W., Jeong, D.-K. & Hwang, C. S. Comprehensive writing
identifying fingerprint and source of true random numbers. IEEE T. margin analysis and its application to stacked one diode-one memory
Comput. 58, 1198–1210 (2009). device for high-density crossbar resistance switching random access
151. Gao, L., Chen, P.-Y., Liu, R. & Yu, S. Physical unclonable function memory. Adv. Electron. Mater. 2, 1600326 (2016).
exploiting sneak paths in resistive cross-point array. IEEE Transactions on 171. Le Gallo, M., Sebastian, A., Cherubini, G., Giefers, H. & Eleftheriou, E.
Electron Devices 63, 3109–3115 (2016). Compressed sensing recovery using computational memory.
152. Nili, H. et al. Hardware-intrinsic security primitives enabled by analogue In Proc. The International Electron Devices Meeting (IEDM) 28–3
state and nonlinear conductance variations in integrated memristors. (IEEE, 2017).
Nat. Electron. 1, 197 (2018). 172. van de Burgt, Y. et al. A non-volatile organic electrochemical device as a
153. Jiang, H. et al. A provable key destruction scheme based on memristive low-voltage artificial synapse for neuromorphic computing. Nat. Mater. 16,
crossbar arrays. Nat. Electron. 1, 548 (2018). 414 (2017).
154. Talati, N., Gupta, S., Mane, P. & Kvatinsky, S. Logic design within 173. Tang, J. et al. ECRAM as scalable synaptic cell for high-speed, low-power
memristive memories using memristor-aided logic (MAGIC). IEEE T. neuromorphic computing. In Proc. The International Electron Devices
Nanotechnol. 15, 635–650 (2016). Meeting (IEDM) 13–1 (IEEE, 2018).
155. Cheng, L. et al. Functional demonstration of a memristive arithmetic logic 174. Fuller, E. J. et al. Parallel programming of an ionic floating-gate
unit (MemALU) for in-memory computing. Adv. Funct. Mater. (2019). memory array for scalable neuromorphic computing. Science 364,
156. Haj-Ali, A., Ben-Hur, R., Wald, N., Ronen, R. & Kvatinsky, S. IMAGING: 570–574 (2019).
In-memory algorithms for image processing. IEEE T. Circuits Systems-I 65, 175. Kimura, H. et al. Complementary ferroelectric-capacitor logic for
4258–4271 (2018). low-power logic-in-memory VLSI. IEEE Journal of Solid-State Circuits 39,
157. Hamdioui, S. et al. Applications of computation-in-memory architectures 919–926 (2004).
based on memristive devices. In Proc. The Design, Automation & Test in 176. Aziz, A. et al. Computing with ferroelectric FETs: Devices, models, systems,
Europe Conference & Exhibition (DATE) 486–491 (IEEE, 2019). and applications. In Proc. The Design, Automation & Test in Europe
158. Xiong, F., Liao, A. D., Estrada, D. & Pop, E. Low-power switching of Conference & Exhibition (DATE) 1289–1298 (IEEE, 2018).
phase-change materials with carbon nanotube electrodes. Science 332, 177. Chanthbouala, A. et al. A ferroelectric memristor. Nat. Mater. 11,
568–570 (2011). 860 (2012).
159. Li, K.-S. et al. Utilizing sub-5 nm sidewall electrode technology for 178. Ríos, C. et al. Integrated all-photonic non-volatile multi-level memory.
atomic-scale resistive memory fabrication. In Proc. Symposium on VLSI Nat. Photon. 9, 725 (2015).
Technology 1–2 (IEEE, 2014). 179. Wuttig, M., Bhaskaran, H. & Taubner, T. Phase-change materials for
160. Salinga, M. et al. Monatomic phase change memory. Nat. Mater. 17, non-volatile photonic applications. Nat. Photon. 11, 465 (2017).
681–685 (2018). 180. Ríos, C. et al. In-memory computing on a photonic platform. Sci. Adv. 5,
161. Pi, S. et al. Memristor crossbar arrays with 6-nm half-pitch and 2-nm eaau5759 (2019).
critical dimension. Nat. Nanotechnol. 14, 35 (2019).
162. Brivio, S., Frascaroli, J. & Spiga, S. Role of Al doping in the filament Acknowledgements
disruption in HfO2 resistance switches. Nanotechnology 28, 395202 (2017). We would like to thank T. Tuma for technical discussions and assistance with scientific
163. Choi, S. et al. SiGe epitaxial memory for neuromorphic computing with illustrations, G. Sarwat and I. Boybat for critical review of the manuscript, and L. Rudin
reproducible high performance based on engineered dislocations. and N. Gustafsson for editorial help. A.S. acknowledges funding from the European
Nat. Mater. 17, 335 (2018). Research Council (ERC) under the European Union’s Horizon 2020 research and
164. Boybat, I. et al. Neuromorphic computing with multi-memristive synapses. innovation programme (grant agreement number 682675).
Nat. Commun. 9, 2514 (2018).
165. Koelmans, W. W. et al. Projected phase-change memory devices.
Nat. Commun. 6, 8181 (2015). Competing interests
166. Giannopoulos, I. et al. 8-bit precision in-memory multiplication with The authors declare no competing interests.
projected phase-change memory. In Proc. The International Electron Devices
Meeting (IEDM) 27–7 (IEEE, 2018).
167. Chen, Y. et al. DaDianNao: A machine-learning supercomputer. In Proc. Additional information
The 47th Annual IEEE/ACM International Symposium on Microarchitecture Correspondence should be addressed to A.S.
609–622 (IEEE Computer Society, 2014). Peer review information Nature Nanotechnology thanks Cheol Seong Hwang and the
168. Ankit, A. et al. PUMA: A programmable ultra-efficient memristor-based other, anonymous, reviewers for their contribution to the peer review of this work.
accelerator for machine learning inference. In Proc. The International Reprints and permissions information is available at www.nature.com/reprints.
Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS), 715–73 (ACM, 2019). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
169. Eleftheriou, E. et al. Deep learning acceleration based on in-memory published maps and institutional affiliations.
computing. IBM Journal of Research and Development (2019). © Springer Nature Limited 2020

544 Nature Nanotechnology | VOL 15 | July 2020 | 529–544 | www.nature.com/naturenanotechnology

You might also like