(Survey) Memory Devices and Applications For In-Memory Computing
(Survey) Memory Devices and Applications For In-Memory Computing
Traditional von Neumann computing systems involve separate processing and memory units. However, data movement is costly
in terms of time and energy and this problem is aggravated by the recent explosive growth in highly data-centric applications
related to artificial intelligence. This calls for a radical departure from the traditional systems and one such non-von Neumann
computational approach is in-memory computing. Hereby certain computational tasks are performed in place in the memory
itself by exploiting the physical attributes of the memory devices. Both charge-based and resistance-based memory devices
are being explored for in-memory computing. In this Review, we provide a broad overview of the key computational primitives
enabled by these memory devices as well as their applications spanning scientific computing, signal processing, optimization,
machine learning, deep learning and stochastic computing.
T
oday’s computing systems are primarily built based on the unit could be referred to as in-memory computing. However, the
von Neumann architecture where data must be moved to a key distinction is that at no point during computation is the memory
processing unit. During the execution of various compu- content read back and processed at the granularity of a single mem-
tational tasks, large amounts of data need to be shuttled back and ory element. This latter scenario, where in addition the processing
forth between the processing and memory units and this incurs is performed in close proximity to the memory array, could instead
significant costs in latency and energy. The latency associated with be viewed as near-memory computing. Besides alleviating the costs
accessing data from the memory units is a key performance bottle- in latency and energy associated with data movement, in-memory
neck for a range of applications, in particular for the increasingly computing also has the potential to significantly improve the com-
prominent artificial intelligence (AI) related workloads. There is an putational time complexity associated with certain computational
increasing disparity between the speed of the memory and process- tasks. This arises mostly from the massive parallelism afforded by
ing units, typically referred to as the memory wall1. The energy cost a dense array of millions of memory devices performing computa-
of moving data is another significant challenge given that the com- tion. It is also likely that by introducing physical coupling between
puting systems are severely power limited due to cooling constraints the memory devices, we can further reduce the computational
as well as the proliferation of mobile computing devices. Even at time complexity11. By blurring the boundary between processing
the relatively old 45 nm complementary metal oxide semiconduc- and memory units (an attribute that is also shared with the highly
tor (CMOS) node, the cost of multiplying two numbers is orders energy-efficient mammalian brain where memory and process-
of magnitude lower than that of accessing them from memory2. ing are deeply intertwined12), we gain significant improvements in
The current approaches, such as the use of hundreds of proces- computational efficiency. However, this is at the expense of the gen-
sors in parallel (for example, graphics processing units3) or appli- erality afforded by the conventional approach where memory and
cation-specific processors4,5 that are custom designed for specific processing units are functionally distinct from each other. In this
applications, are not likely to fully overcome the challenge of data Review, we first give an overview of the memory devices that facili-
movement. Hence, it is becoming increasingly evident that novel tate in-memory computing as well as the key in-memory computa-
architectures need to be explored where memory and processing are tional primitives that are enabled. Subsequently, we present a range
better collocated. One prominent idea that dates to the 1990s is that of applications that exploit these primitives. Finally, we present an
of physically placing monolithic compute units closer to a mono- outlook on the opportunities and challenges.
lithic memory6. This concept known as near-memory computing
has benefitted significantly from recent advances in die stacking Memory devices
technology7 and the commercialization of advanced memory mod- Memory is at the heart of in-memory computing. One of the pri-
ules such as the hybrid memory cube (HMC)8 and high bandwidth mary means to store information to date is through the presence
memory (HBM)9. To achieve a denser and more fine-grained con- or absence of charge such as in dynamic random access memory
nectivity between memory and processing units, even three-dimen- (DRAM), static random access memory (SRAM) and flash mem-
sional (3D) monolithic integration has been proposed10. However, ory13. There is also an emerging class of memory devices where
in all of these approaches that aim to reduce the time and distance to information is stored in terms of differences in the atomic arrange-
memory access, there still exists a physical separation between the ments or orientation of ferromagnetic metal layers. Such differences
memory and the compute units. manifest as a change of resistance and these devices are thus termed
In-memory computing is an alternate approach where certain resistive memory devices14. Sometimes they are also referred to as
computational tasks are performed in place in the memory itself memristive devices due to their relation to the circuit theoretic con-
organized as a computational memory unit. As schematically illus- cept of memristive systems15.
trated in Fig. 1, this is achieved by exploiting in tandem the physical One of the primary characteristics of a memory device is the
attributes of the memory devices, their array-level organization, the access time, that is, how fast information can be stored (written)
peripheral circuitry as well as the control logic. Any computational and retrieved (read). Another key characteristic is cycling endur-
task that is realized within the confines of a computational memory ance, which refers to the number of times a memory device can be
e
Cach
Memory array
ALU (storing D)
f D
11101001010001010100100100
f (D)
Result f (D)
Computation in
processor
Fig. 1 | In-memory computing. a, In a conventional computing system, when an operation f is performed on data D, D has to be moved into a processing
unit, leading to significant costs in latency and energy. b, In the case of in-memory computing, f(D) is performed within a computational memory unit by
exploiting the physical attributes of the memory devices, thus obviating the need to move D to the processing unit. The computational tasks are performed
within the confines of the memory array and its peripheral circuitry, albeit without deciphering the content of the individual memory elements. Both
charge-based memory technologies, such as SRAM, DRAM and flash memory, and resistance-based memory technologies, such as RRAM, PCM and
STT-MRAM, can serve as elements of such a computational memory unit.
switched from one state to the other. The memory devices in a com- which forms a fixed-height barrier, and the FET. Since the maxi-
putational memory unit are usually organized in a two-dimensional mum height of the FET barrier is limited by the band-gap of silicon
(2D) array with horizontal and vertical wires, typically referred to as (≈1.1 eV), the charge can be retained only for a fraction of a second
the word line (WL) and the bit line (BL), used to access them. The and this necessitates periodic refresh. As shown in Fig. 2c, in a Flash
memory array in a computational memory unit can be quite similar memory cell, the charge storage node is coupled to the gate of a FET
to that in a conventional memory unit but with certain differences with charge stored either on a conductive electrode surrounded
in the read/write circuitry, the format of the input/output data as by insulators (floating gate) or in discrete traps within a defective
well as the control logic. For example, depending on the applica- insulator layer (charge trapping layer). Unlike in DRAM, the barrier
tions, multiple WLs need to be activated in parallel or analogue height of the storage node is sufficiently high for long-term data
output currents along BLs need to be sensed precisely. retention. However, the write operation requires high voltages (typi-
cally >10 V) and entails significant latency (>10 µs) due to the need
Charge-based memory. An SRAM cell is a bi-stable transistor to overcome the storage node barriers. Depending on how the flash
structure typically made of two CMOS inverters connected back to memory cells are organized, they are referred to as NOR or NAND
back, as shown in Fig. 2a. The output potential of one inverter is Flash. In NOR Flash, every memory cell is connected to a BL, while
applied as input to the other, forming a feedback loop that freezes in NAND Flash, several memory cells connected in series share a
the cell in a given logical state (0 or 1). Two additional field-effect single connection to the BL. A flash memory cell stores fewer elec-
transistors (FETs) serve as selectors, yielding a standard 6 transistor trons than DRAM and SRAM. Flash memory also has a substan-
(6T) SRAM cell. SRAM is built entirely from FETs and has no dedi- tially lower cycling endurance due to the gate oxide degradation
cated storage element. However, one can view the charge as being under strong electric fields.
confined within the barriers formed by the FET channels and the A range of in-memory logic and arithmetic operations can be per-
gate insulators. Due to the low FET barrier height (0.5 eV), how- formed using both SRAM and DRAM. Capacitive charge redistribu-
ever, the charge constantly needs to be replenished from an external tion serves as the foundation for many of them, in particular storing
source and hence SRAM always needs to be connected to a power and sharing of charge across multiple storage nodes. In DRAM, simul-
supply. A DRAM cell consists of a capacitor placed in series with a taneous reading of devices along multiple rows can be used to exe-
FET (Fig. 2b). The charge is confined within the capacitor insulator, cute basic Boolean functions within the memory array16,17. Figure 2d
– –
– – +
– Floating
– – – gate
– ––– –
–
––– Gate – –––––
–
– n n n n
– p p
WL WL WL
BL BL C BL BL
Source
~ 103 electrons ~ 105 electrons ~ 100 electrons
Access time < 1 ns Access time < 10 ns Access time (read) < 100 ns
Endurance > 1016 Endurance > 1016 Endurance > 105
d e
BL AND Operation
AND operation
WL VBL VSEL = 0V VREF BL BL
VA VBL VREF
WL
CA
WL B
VB 00 01 10 11 B
00 01 10 11
Bit combinations
CB WL Bit combinations
VBL OR Operation
WL VSEL � 0V NOR operation
VSEL VREF VBL VREF
VREF VREF
A A
CSEL VREF
SA SA
SA 00 01 10 11
VBL VBL 00 01 10 11
Bit combinations
VA + VB + VSEL A AND B A NOR B Bit combinations
VBL =
3
f g
Vin1
1 VDD VDD
1
3 3
2 2 Vin1 Vin2
A12 A22
Fig. 2 | Charge-based memory devices and computational primitives. a, A 6T SRAM cell consists of two CMOS inverters connected back to back. The
charge is confined within the barriers formed by FET channels and by gate insulators. The stored charge retention is small and an external source constantly
replenishes the lost charge. SRAM has almost unlimited cycling endurance and sub-nanosecond read and write access times. b, A DRAM cell comprises
a capacitor (C) that serves as the storage node, which is connected in series to a FET. c, The storage node of a flash memory cell is coupled to the gate
of a FET. d, Schematic illustration of bit-wise logical operations performed using three DRAM cells. The operands are stored in cells A and B. AND or OR
operations are performed by simultaneously activating the three WLs corresponding to the cells. The logical state of cell SEL is used to dictate whether an
AND or an OR operation is performed, with logical one and zero corresponding to OR and AND operations, respectively. The BL voltage corresponds to
the average voltage across the three capacitors and is sensed using a sense amplifier with a decision threshold voltage of VREF. e, Bit-wise logical operations
using an SRAM array. The BL and BL are pre-charged to the supply voltage, VDD, prior to the execution of the operation. After deactivation of the pre-charge
signal, both the WLs are activated Iso that both BL and BL are discharged at different rates that depend on the data stored in the bit-cells. When the two
activated SRAM cells in a column are both 1 (0), VBL (VBL I ) will be comparable to V , whereas for the other bit combinations, both V and V will be lower
DD BL BL
than VDD. Hence, by sensing VBL and VBL with a SA, AND I and NOR operations are performed, respectively. f, Schematic illustration of performing
I MVM
operation using an array of SRAM cells I and capacitors. The SRAM cells are used to store the elements of the binary matrix. In the first step, the inputs are
provided per row that charges the capacitors on that row to a value proportional to the input. In step two, the capacitors that are associated with the SRAM
elements storing 0s are discharged. Finally, in step three, the capacitors are shorted along the columns performing a charge sharing operation so that the
final voltage on the capacitors corresponds to the analogue MVM result. g, Illustration of an MVM operation performed using Flash memory devices. The
current IDS is a function of the cell’s threshold voltage Vt as well as the drain–source voltage VDS and the gate–source voltage VGS. By fixing VDS, Kirchhoff’s
current law can be employed to perform MVM between a matrix, stored in terms of Vt, and a binary input vector that is used to modulate VGS.
shows a basic cell configuration that can be used to implement bit- least the 1960s31, key technological demonstrations in the 2000s32–34
wise AND/OR functions. Two memory cells, A and B, are used to gave significant impetus to this technology. Phase change memory
store the operands. The logic state of the third cell, SEL, is set to 0 (PCM), which also dates back to the 1960s35, is based on the prop-
or 1 depending on whether an AND or an OR operation is realized, erty of certain types of materials, such as Ge2Sb2Te5, to undergo
respectively. When all three cells are activated simultaneously, the a Joule heating-induced, rapid and reversible transition from a
bit-line voltage corresponds to the average voltage across the three highly resistive amorphous phase to a highly conductive crystalline
capacitors. This voltage is sensed using a sense amplifier (SA) with phase36,37. As shown in Fig. 3b, a typical PCM device has a mush-
a single decision threshold, which outputs the result of the logical room shape where the bottom electrode confines heat and current.
operation. By using the negated output of the SA to also implement This results in a near-hemispherical shape of the amorphous region
the NOT operation, a functionally complete set of Boolean functions in the HRS state. By crystallizing the amorphous region, the LRS
is obtained. These bit-wise operations can be performed along the state is obtained. A relative newcomer to the resistive memory fam-
entire row of memory devices thus enabling parallel bulk bit-wise ily, magnetoresistive random access memory (MRAM) consists of a
operations. Unlike DRAM, the SRAM cells do not contain a built- magnetic tunnel junction (MTJ) structure with two ferromagnetic
in capacitor and hence the parasitic BL capacitance is used instead metal layers (pinned and free). These layers, for example made of
to enable bulk in-memory logical operations18,19. In Fig. 2e, a basic the CoFeB alloy, are separated by a thin tunnel oxide such as MgO
construct for performing in-place bit-wise logical operations using (Fig. 3c). In the pinned layer, the magnetic polarization is structur-
SRAM is shown. Here, again, both of the WLs are activated simulta- ally fixed to act as a reference, whereas in the free layer it is free to
neously and by sensing the BL and BL with an SA, AND and NOR change during the write operation. Voltage pulses of opposite polar-
operations are performed, respectively. I Besides realizing the logical ity are applied to switch the polarization of the free layer. Depending
primitives, it is also essential to efficiently cascade such operations. To on whether the two ferromagnetic polarizations are parallel or anti-
perform cascadable logic operations using both DRAM and SRAM, parallel, the LRS and HRS states are obtained due to the tunnel mag-
additional cloning or duplication steps need to be enabled, allow- netoresistive effect. Spin transfer torque MRAM (STT-MRAM) is
ing the construction of in-memory full adders and multipliers17,20. currently the most promising MRAM technology38,39. RRAM and
The overhead of having to serially execute the cascaded operations is PCM operate based on the rearrangement of atomic configurations
overcome by the ability to process several bit lines in parallel. and hence have worse access times (write speed) and cycling endur-
SRAM arrays can also be used for matrix-vector multiplication ance than MRAM. However, they have substantially larger resis-
(MVM) operations, Ax = b, where A is the data matrix, x is the input tance windows that enable the storage of intermediate resistances
vector, and b is the output vector21–23. If the elements of A and x are even at an array level. RRAM has the advantage of using materi-
limited to signed binary values, the multiply operation is simplified als that are common in semiconductor manufacturing. However,
to a combination of XNOR and ADD functions. Here, a 12T SRAM in spite of the simplicity of the device concept, a comprehensive
cell can be designed to execute XNOR operations within every mem- understanding of the switching mechanism is still lacking compared
ory cell21. In cases where x is non-binary, one approach is to employ to PCM and MRAM.
capacitors in addition to the SRAM cells22–24. It was recently shown One of the attributes of memristive devices that can be exploited
how 6-bit inputs can be multiplied with binary matrices stored in for computation is their non-volatile binary storage capability.
SRAM22. This involves a three-step process that is illustrated in Logical operations are enabled through the interaction between
Fig. 2f. Note that the additional capacitors and switches could be the voltage and resistance state variables40. One particularly inter-
shared among a group of SRAM cells at the expense of reduced par- esting characteristic of certain memristive logic families is stateful-
allelism and hence operational bandwidth. It is also possible to build ness, where the Boolean variable is represented solely in terms of
the analogue capacitor-based circuits in the vicinity of the SRAM the resistance states41–43. A schematic illustration of one such state-
array to accelerate MVM via near-memory computing25,26. ful memristive logic, MAGIC, that realizes the NOR logic opera-
Flash memory can also be used to perform MVM operations27,28. tion is shown in Fig. 3d44. Both the operands and the result are
The gate voltage is modulated in accordance with a binary input stored in terms of the resistance state variable. Stateful logic can be
vector (see Fig. 2g). The matrix elements are stored as charge on the realized almost entirely in the memory array and has been dem-
floating gate28. Because the devices can be accessed in parallel along onstrated for RRAM41 and STT-MRAM45. Stateful logic is also cas-
a BL, NOR Flash has generally been preferred over NAND Flash cadable, whereby the output from one logical gate can directly feed
for in-memory computing. However, there is recent work describ- into the input of a second logic gate46,47. However, in stateful logic,
ing the use of 3D NAND, consisting of vertically stacked layers of the devices repeatedly get written into during the execution of the
serially connected FLASH devices, whereby each layer of the array logical operations, which is a key drawback due to the associated
encodes a unique matrix29. This approach could help to overcome energy cost and the limited cycling endurance of the devices. H
the scalability issue of NOR Flash, which is difficult to scale beyond ence, there is renewed interest in non-stateful logic such as the one
the 28 nm technology node. shown in Fig. 3e. Here, the logical operands are stored as resistance
values, but the result of the logical operation is computed as a volt-
Resistance-based memory. Memristive devices can be pro- age signal48,49. The operands stay fixed in the memory array and
grammed to be in a low resistance state (LRS) or a high resistance the devices need not be programmed during the evaluation of the
state (HRS) through the application of electrical SET and RESET logical operation. However, the sequential cascading of these logi-
pulses, respectively. There is also the possibility to achieve interme- cal operations requires additional circuits, typically located outside
diate resistance levels in certain types of memristive devices. The of the memory array. Memristive threshold logic is yet another
devices are typically organized in a 2D array and require a selection non-stateful logic family where both the inputs and outputs
device in series with each device to prevent parasitic sneak path cur- are voltage signals and the logical functions are defined using the
rents during writing and reading30. resistance values50.
Resistive random access memory (RRAM) devices comprise The non-volatile storage capability, in particular, the ability to
metal–insulator–metal (MIM) stacks (Fig. 3a) and the resistive store a continuum of conductance values, facilitates the key compu-
switching process typically involves the creation and disruption of tational primitive of analogue MVM51–53. The physical laws that are
conductive filaments (CF) comprising a localized concentration of exploited to perform this operation are Ohm’s law and Kirchhoff ’s
defects. An LRS state corresponds to CFs bridging the two metal current summation laws (Fig. 3f). Memristive devices also exhibit an
layers. Even though the history of RRAM can be traced back to at accumulative behaviour52,54,55, whereby the conductance of devices
Bottom
Bottom electrode electrode Pinned Layer
d Bit combination = 00 e f
0
Operands 1 1
Iout1
0
VREAD
>2VRESET Result Vin1
GA Iout IREF G11 G21
Bit combination = 01
0 AND Iout2
1 0
IREF Vin2
1
GB G12 G22
Iout OR
Bit combination = 11
1
SA 00 01 10 11 Iout1 Iout2
1 0
1 Bit combinations
Vin1 Vin2
Fig. 3 | Resistance-based memory devices and computational primitives. a, An RRAM device in the LRS where the CF comprises a large concentration
of defects for example oxygen vacancies in metal oxides or metallic ions injected from the electrodes. By the application of appropriate voltage pulses,
the defects can be migrated back to the top electrode thus disconnecting the CF and achieving a HRS. b, A mushroom-type PCM device in the HRS
state where the amorphous phase blocks the bottom electrode. To create this state, a RESET pulse is applied that can melt a significant portion of the
phase change material. When the pulse is stopped abruptly, the molten material quenches into the amorphous phase due to glass transition. When a
current pulse of lesser amplitude is applied to the PCM device in the HRS state, a part of the amorphous region crystallizes. By fully crystallizing the
phase change material, the LRS state is obtained. c, An STT-MRAM device with two ferromagnetic layers (pinned and free) separated by a tunnel oxide
layer. The magnetic polarization of the free layer can be changed upon writing. Depending on whether the ferromagnetic polarizations are parallel or
antiparallel, the device assumes a low or high resistance, respectively. The transition to the parallel state takes place directly through conduction electrons,
which are previously spin-polarized by the pinned layer. Subsequently, the magnetic polarization of the free layer is rotated using magnetic momentum
conservation. To switch to the antiparallel state, an opposite voltage, and hence current direction, is employed. d, Schematic illustration of a stateful
NOR logic operation using 3 bipolar memristive devices44. Two devices represent the operands and one represents the result. First, the result device is
initialized to logic 1 (LRS). Subsequently, a voltage pulse with an amplitude larger than twice that of VRESET is applied simultaneously to both the operand
devices. If either operand device is at logic 1 (LRS), then at least half of the voltage drops across the result device and the latter switches to logic 0 (HRS).
Note that, due to the bipolar switching behaviour, the operand devices remain unchanged as long as VSET 2VRESET. When both the operand devices
I Hence it remains at logic 1. Thus, this simple
are at logic 0 (HRS), the voltage dropped across the result device is not sufficient to switch it to logic 0.
circuit implements a NOR operation where all the logic state variables are represented purely in terms of resistance values. e, Non-stateful AND and OR
operations using 2 memristive devices and a variable threshold, SA. By simultaneously activating multiple rows, and with the appropriate choice of current
thresholds, it is possible to implement logical operations such as AND and OR. f, To perform the operation Ax = b, the elements of A are mapped linearly
to the conductance values of memristive devices organized in a crossbar configuration. The x values are mapped linearly to the amplitudes or durations
of read voltages and are applied to the crossbar along the rows. The result of the computation, b, will be proportional to the resulting current measured
along the columns of the array. Note that, if the inputs are mapped onto durations, the result b will be proportional to the total charge (for example,
current integrated over a certain fixed period of time). It is also possible to perform an MVM operation with the transpose of A using the same cross-bar
configuration by applying the input voltage to the column lines and measuring the resulting current along the rows. The negative elements of x are typically
applied as negative voltages whereas the negative elements of A are coded on separate devices together with a subtraction circuit.
such as PCM and RRAM progressively increases or decreases with randomness. A high-level overview of the main applications that
the successive application of appropriate programming pulses. This are being researched for in-memory computing is shown in Fig. 4.
non-volatile accumulative behaviour, in spite of its nonlinear and In-memory computing can be applied both to reduce the computa-
stochastic nature, can be exploited in several applications, such as tional complexity of a problem as well as to reduce the amount of
training deep neural networks, where the conductance values need data being accessed by performing computations inside the memory
to be incrementally modified. arrays. The problems that could benefit the most from the complex-
ity reduction are the NP-hard problems involving combinatorial
Applications optimization. Data-centric applications in machine learning and
The computational primitives reviewed in the Memory Devices scientific computing benefit the most from reduced memory access.
section have been applied to a wide range of application domains, In this section, we review how in-memory computing has been
ranging from scientific computing that requires high preci- applied to those applications and discuss the challenges involved
sion, to stochastic computing that is enabled by imprecision and with respect to the device properties presented previously.
Data accesses
Low High
Combinatorial
optimization Deep learning
training
Reservoir
computing Spiking neural Solving linear
networks and partial
differential equations
Computational complexity
Random Principal
number Compressed component
generation Deep learning analysis
Sparse sensing
Unclonable inference coding
functions Associative
memory
Image
filtering and
compression
Fig. 4 | The application landscape for in-memory computing. The applications are grouped into three main categories based on the overall degree of
computational precision that is required. A qualitative measure of the computational complexity and data accesses involved in the different applications
is also shown.
Scientific computing. Linear algebra computational kernels, to multi-level memristive devices. The bit slice approach applied
such as MVM, are common not only to machine learning but to a 16-bit input vector sliced into 16 1-bit slices for increasing
also to scientific computing applications. However, both memris- numerical precision has been demonstrated experimentally where a
tive and charge-based memory devices suffer from significant numerical differential equation solver using a small Ta2O5–x RRAM
inter-device variability and inhomogeneity across an array. 16x3 crossbar array was successfully implemented61.
Moreover, they exhibit intra-device variability and random- Although the bit slice technique appears to address the limita-
ness that is intrinsic to how they operate. Hence, the precision of tions surrounding the precision of analogue MVM operations, there
analogue MVM operations with these devices is rather low. are still inaccuracies arising from the analogue summation along
Although approximate solutions are sufficient for many computa- columns, which potentially could be more detrimental in larger
tional tasks in the domain of AI, building an in-memory computing crossbar arrays. Moreover, the extra peripheral circuitry of the shift-
unit that can effectively address scientific computing and data ana- and-add external reduction networks could substantially increase
lytics problems—which typically require high numerical accuracy— the energy consumption and area. Mixed-precision computing is an
remains challenging. alternate approach to achieve high precision processing based on in-
The aforementioned accuracy limitation can, to a certain extent, memory computing. This approach is based on the well-established
be remedied by an old technique in computer architecture called ‘bit iterative refinement technique for improving a computed solution
slicing’. Bit slicing is a general approach for constructing a proces- to a system of linear equations62. Through this technique, the time
sor from modules of smaller bit width. Each of the modules pro- complexity of iterative linear solvers can be reduced by combining
cesses one bit field or ‘slice’ of an operand56. The grouped processing low-precision with high-precision arithmetic63. The adaptation of
components will then have the capability to process, in parallel, an this concept for in-memory computing and experimental demon-
arbitrarily chosen full word-length of a particular task. This con- stration of solving a system of 5,000 linear equations using 998,752
cept has been proposed for increasing the accuracy of the in-place PCM devices with arbitrarily high accuracy was presented in ref. 64.
MVM based on in-memory computing (Fig. 5a)57–60. According Here, the idea is to use fast but imprecise MVM, via in-memory
to this technique, an n-bit element of the matrix is mapped onto computing in an iterative linear solver, to obtain an approximate
device conductance values of n binary crossbar arrays, that is, n bit solution, and then refine this solution based on the residual error
slices. Thus, each bit slice contains the binary values of the matrix calculated precisely through digital computing (Fig. 5b). The main
elements in a particular bit position. Similarly, bit slicing can also limitation of this technique is that the data need to be stored both in
be applied to the input vector elements, where each bit slice is input crossbar arrays as well as in the memory of a high-precision digital
to the crossbar arrays one at a time. To perform an in-place MVM, processing unit, which increases the resources needed to solve the
a vector bit slice is multiplied with a matrix bit slice, with 0(1) problem. Moreover, the achievable speedup comes from reducing
time complexity, and the partial products of these operations are the number of iterations needed to solve the problem, resulting in
combined outside of the crossbar arrays through a shift-and-add an overall computational complexity of 0(N2) for a NxN matrix, that
reduction network57. Note that the bit slices can also be implemented is, still proportional to the problem size.
on the same crossbar array in a column-by-column manner. In this Several extensions to these two techniques are imaginable to fur-
case, columns at a distance n from each other represent a single bit ther improve the performance benefits and reliability. One way to
slice. Although the above concept has been described based on bit potentially speed up linear solvers further is to realize a one-step lin-
slices, that is, binary memristive arrays, it can easily be generalized ear solver in the analogue domain65, which has been demonstrated
0 0 0 Crossbar array(s)
0 1 1 3 ... ... ... ... ...
... ...
1 0
...
1
... ...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
6
... ... ... ... ...
1 1 0
3 1 1
0 1 0 2
Analogue to digital
converters
<<
<< << Input MSB Arithmetic and Control
0 2 1 logic unit unit
+ Memory
6
14 + <<
8 Central processing unit
<<
<< << Input LSB
4 4 0
+
High-precision digital processing unit
Fig. 5 | Increasing the precision of in-memory computing for scientific computing. a, Implementation of the bit slicing concept in a crossbar array for an
inner product operation. The 3-bit data vector is sliced into three 1-bit vectors stored on three separated columns of the crossbar array. The 2-bit input
vector is sliced into two 1-bit vectors sequentially applied to the crossbar array as voltages. The outputs of the crossbar from the first input bit slice go
through an analogue to digital conversion and appropriate shifting prior to accumulation and storage in a local buffer as a partial inner product result. The
second input bit slice undergoes the same process, producing the second partial inner product result. These two partial inner product results are added up,
yielding the final result of the in-place inner-product vector operation. b, The concept of mixed-precision in-memory computing used to iteratively improve
the computed solution to a system of linear equations based on inaccurate MVM operations performed via analogue in-memory computing.
using a 3x3 RRAM crossbar array66. This approach is based on column by column. The compression is then performed by keeping
an old idea of analogue matrix inversion67, whereby a known vec- only a certain ratio of the highest coefficients of the transformed
tor, forced as currents on the columns of the crossbar, establishes image and discarding the rest. Compression experiments using a
an output voltage vector at the rows, which is equal to the prod- 128x64 crossbar array of hafnium oxide (HfO2) devices yielded
uct of the inverse of the conductance matrix multiplied by the vec- reasonably well-reconstructed images, although with a few vis-
tor of currents. Although the high parallelism provided by this ible artefacts due to device non-idealities70. The transform coding
approach is promising, its implementation is hardwired and there- described above for sparsifying large signals is fundamental to com-
fore not scalable, and requires very precise conductance tuning mon compression schemes such as JPEG or MPEG, but can also be
and high linearity of current–voltage characteristics. There are also used for compressed sensing. The basic idea of compressed sens-
initial results on error correction schemes68 as well as extensions ing is to acquire a few (M) sampling measurements from a high-
to the bit-slicing concept for achieving floating-point accuracy69 dimensional signal of size N, and to subsequently recover that signal
on memristive crossbar arrays. These research avenues could accurately. Compressed sensing can be realized via in-memory
enlarge the application space of in-memory computing to encom- computing by encoding the MxN measurement matrix used for this
pass applications in scientific computing where high computational process, which typically contains randomly distributed elements, in
accuracy is required. a crossbar array of memory devices65,71. This array can be used to
perform the MVM operations associated with both the compression
Signal processing, optimization and machine learning. There are and recovery tasks. The efficacy of this scheme has been experimen-
several applications in the domain of signal processing, optimiza- tally demonstrated through 128x128 image compression and recon-
tion and machine learning where approximate solutions can be struction tasks using more than 256,000 PCM devices71. However,
considered acceptable, and the bulk of the computation could thus here as well, device non-idealities such as conductance noise were
be performed with in-memory computing. The crossbar-based ana- found to reduce the reconstruction accuracy.
logue MVM can be used in many applications such as image com- In the field of optimization, a promising application of in-mem-
pression, compressed sensing, combinatorial optimization, sparse ory computing is for combinatorial optimization problems, such as
coding, principal component analysis, associative memories and the travelling salesman problem, Boolean satisfiability and integer
reservoir computing. linear programming. Combinatorial optimization is the process of
The application of in-memory computing to analogue image searching for maxima or minima of an objective function whose
compression has been studied experimentally in ref. 70. The idea domain is a discrete but large configuration space. To address
is to encode a transform matrix, for example, a discrete cosine these computationally intensive typically NP-hard problems, simu-
transform, as the conductance values of devices organized in a lated annealing inspired approaches, such as the massively parallel
crossbar array. The image pixel intensities, represented as voltages, Boltzmann machines and Hopfield networks, have been proposed.
are applied to the crossbar first row by row and, in a second step, The basic idea is to compute the inner products, the fundamental
building blocks in Boltzmann machines57 or Hopfield networks72, high-dimensional space such that it is possible to classify the input
in place via in-memory computing. For solving the problem, the patterns with a simple linear classifier. One of the approaches to
network is run until convergence, that is, the energy is minimized, implement RC is to feed the input into a fixed physically realized
which involves updating only the state variables, while the weights dynamical system. Memristive devices could play a key role in these
implemented in the crossbar array remain constant. An interesting types of physical RC. For example, Du et al. proposed the use of a
prospect is to utilize the device noise as an explicit source of noise collection of memristive devices with short-term temporal dynam-
to force the network to continuously explore the solution space, ics to serve as the physical reservoir and to classify temporal sig-
which is necessary to achieve proper convergence72,73. However, it nals89. Sebastian et al. used a reservoir of a million PCM devices and
is required to precisely control this noise via an annealing schedule, exploited their accumulative behaviour to classify binary random
which is challenging to implement. Another intriguing approach, processes into correlated and uncorrelated classes90.
going beyond simply accelerating the inner products in recurrent
networks, is to use a network of coupled nonlinear analogue oscil- Deep learning. Recently, deep artificial neural networks, loosely
lators whose dynamics execute an efficient search for solutions of inspired by biological neural networks, have shown a remarkable
combinatorial optimization problems73. Volatile memristive devices human-like performance in tasks such as image processing and
based on Mott insulator–metal transition materials, such as VO2 voice recognition91. A deep neural network (DNN) consists of at
(ref. 74) and NbO2 (ref. 75), as well as spintronic oscillators based on least two layers of nonlinear neuron units interconnected by adjust-
MTJs (ref. 76) can be used to realize compact nanoscale oscillators able synaptic weights. Modern DNNs can have over 1000 layers92.
that facilitate this form of computing. By tuning the adjustable weights, for instance, optimizing them by
Several memory-centric problems in machine learning could using millions of labelled examples, these networks can solve cer-
also benefit from in-memory computing. One is sparse diction- tain problems remarkably well. Dedicated mixed-signal chips that
ary learning, a learning framework in which a sparse representa- could implement multi-layer networks were already developed in
tion of input data is obtained in the form of a linear combination of the early 1990s but were eventually abandoned in favour of field-
basic elements, which form the so-called dictionary of features. As programmable gate arrays (FPGAs) and general-purpose graphics
opposed to the transform coding approach described earlier, both processing units (GPGPUs), partly due to lack of flexibility93. While
the dictionary and the sparse representation are learned from the high-performance GPGPUs are incontestably the hardware that has
input data. If the learned dictionary is mapped onto device conduc- been primarily responsible for the recent success of deep learning,
tance values in a crossbar array, it is possible to obtain the sparse mixed-signal architectures based on in-memory computing are
representation using the iterative-shrinking threshold77 or locally being actively researched, targeting mostly edge computing applica-
competitive algorithms78. The matrix-vector and the transpose- tions where high energy efficiency is critical.
matrix-vector multiplications associated with the algorithms are A DNN can be mapped onto multiple crossbar arrays of memory
performed in the crossbar. Dictionary learning requires updating devices that communicate with each other as illustrated in Fig. 6a.
the conductance values by exploiting the accumulative behaviour A layer of the DNN can be implemented on (at least) one crossbar,
of the memristive devices, based on, for example, stochastic gradi- in which the weights Wij of that layer are stored in the charge or con-
ent descent77,79, which is challenging due to device stochasticity and ductance state of the memory devices at the crosspoints. The propa-
nonlinear conductance change with the number of applied pulses79. gation of data through that layer is performed in a single step by
Another application is principal component analysis, a dimension- inputting the data to the crossbar rows and deciphering the results
ality reduction technique to reveal the internal structure of data at the columns. The results are then passed through the neuron non-
by using a limited number of principal components. It is usually linear function and input to the next layer. The neuron nonlinear
achieved by finding the eigenvectors of the data covariance matrix. function is typically implemented at the crossbar periphery, using
This can be realized using the ‘power iteration’ method in which analogue or digital circuits. Because every layer of the network is
the MVM operations can be performed using in-memory comput- stored physically on different arrays, each array needs to communi-
ing65. An alternative approach is to use a linear feedforward neural cate at least with the array(s) storing the next layer for feed-forward
network in which the weights are implemented in a crossbar array. networks, such as multi-layer perceptrons (MLPs) or convolutional
The network is optimized via unsupervised learning using Sanger’s neural networks (CNNs). For recurrent neural networks (RNNs),
rule to obtain the principal components, given by the weights con- the output of an array needs to communicate with its input. Array-
nected to each output neuron representing the classes in which the to-array communication can be realized using a flexible on-chip
data is clustered80. network, akin to those used in digital DNN accelerators94. However,
Another relevant application for in-memory computing, which their efficient adaptation to in-memory computing based architec-
is used in several machine learning algorithms, is associative mem- tures is still being explored95.
ory. An associative memory compares input search data with the The efficient MVM realized via in-memory computing is very
data stored in it and finds the address of the data with the closest attractive for inference-only applications, where data is propagated
match to the input data81. This capability is used in several learn- through the network on offline-trained weights. With respect to
ing frameworks, such as brain-inspired hyperdimensional com- specialized inference accelerators operating at reduced digital pre-
puting82,83 and memory-augmented neural networks84,85. One way cision (4 to 8-bit), such as Google’s tensor processing unit4 and
to realize associative memory is to use a Hopfield network, which low-power GPGPUs such as NVIDIA T496, in-memory computing
can be trained to minimize the energy of the states that it should aims to improve the energy efficiency even further by eliminat-
remember. This has been successfully demonstrated on small arrays ing the separation between memory and processing for the MVM
of PCM86 and RRAM87 devices. Another more straightforward way operations. Implementations using SRAM-based in-memory com-
to realize associative memory is simply to encode the stored data puting has focused on binary weight networks, in which weights
directly in a crossbar array and compute, in parallel, the Hamming are represented by a single bit97. Various implementations, such
distances of each stored data vector with the input search data vec- as current-based21 and charge-based22,23 computational circuits,
tor via in-memory dot-products88. have been proposed and were able to demonstrate 1-bit arithme-
Finally, the collective dynamics of an ensemble of dynamical tic energy efficiencies of >100 tera operations per second per watt
systems could be exploited to perform certain machine learning (TOPS W−1) for MVM. Chips using in-memory computing on
tasks. One prominent example of this is reservoir computing (RC). non-volatile memory devices have also been fabricated using NOR-
The essential idea of reservoir computing is to map inputs into a Flash28 and RRAM98–100. Using non-volatile memory ensures that
Cow
Computational memory
Peripheral circuits
... ... ...
Digital interface
...
...
...
...
...
...
...
...
...
... ... ...
Control
unit Peripheral circuits
Communication network
b c
... ...
xi
... ...
Σj δjWij
Σj δjWij
Digital processor
xi
...
...
...
...
...
...
xi
xi
Compute
... ∆Wij ...
2: Backward
2: Backward
1: Forward
1: Forward
3: Update
Fig. 6 | Deep learning training and inference using in-memory computing. a, Implementation of a feed-forward DNN on multiple crossbar arrays of
memory devices. The synaptic weights Wij are stored in the conductance or charge state of the memory devices. Each layer of the network is implemented
in a different crossbar. Forward propagation of data through the network is performed by applying, for each layer, input data on the crossbar rows, and
deciphering the results at the column level. The results are then passed through a nonlinear function implemented at the periphery and input to the next
layer. A global communication network is used to send data from one array to another. b, A first possible implementation of the three steps performed
in training a layer of a neural network in a crossbar array. Forward and backward propagations are implemented by inputting activations xi and errors δj
on the rows and columns, respectively. An in-place weight update can be performed by sending pulses based on the values of xi and δj from the rows and
columns simultaneously. This implements an approximate outer product and programs the devices at the same time. c, A second possible implementation,
whereby the weight update ∆Wij is computed in the digital domain and applied via programming pulses to the corresponding devices.
the weights will be retained when the power supply is turned off, different chips. Another important research topic is the design
unlike with SRAM. Also, the multi-level storage capability of these of efficient intra- and inter-layer pipelines58 to ensure that all the
devices can be exploited to implement non-binary networks, which arrays on the chip are always active during inference, together with
yield higher accuracy and are easier to train than binary weight net- flexible array-to-array communication and control. It is especially
works. Usually, at least two devices per weight are used in a differ- important for CNNs, in which a large image is passed through small
ential configuration to implement positive and negative weights101. kernels at only a few pixels at a time106, leading to prohibitive laten-
Multiple binary/multi-level devices using the bit-slicing technique cies and buffer requirements if no pipelining is used.
can be used to further increase the precision58,59,98,102. The state-of- In-memory computing can also be used in the context of
the-art experimental demonstrations of DNN inference based on supervised training of DNNs with backpropagation. This training
in-memory computing have reported a competitive energy effi- involves three stages: forward propagation of labelled data through
ciency of ≥10 TOPS W−1 for reduced-precision MVM (Table 1). the network, backward propagation of the error gradients from
Nonetheless, for all these implementations, custom training103–105 output to the input of the network, and weight update based on the
and/or on-chip retraining25,100 of the network is needed to mitigate computed gradients with respect to the weights of each layer. This
the effect of defects, and device and circuit level non-ideality on procedure is repeated over a large dataset of labelled examples for
the network accuracy. The training procedure should be generic multiple epochs until satisfactory performance is reached by the
and as agnostic as possible to the hardware such that the network network. This makes the training of state-of-the-art networks very
would have to be trained only once to be deployed on a multitude of time and energy-consuming even with high-performance GPGPUs.
Table 1 | State-of-the-art chip-level experimental demonstrations of neural network inference based on in-memory computing
Device SRAM SRAM SRAM nor-Flash RRAM RRAM
CMOS technology 65 nm 65 nm 65 nm 180 nm 130 nm 55 nm
Array size 16 kb 16 kb 2.4 Mb 100 kb 16 kb 1 Mb
Weight/activation precision 1 bit/6 bit 1 bit/ternary 1 bit/1 bit Analogue/analogue Analogue/8 bit 3 bit/2 bit
Network LeNet-5 CNN MLP/CNN 5/9-layer CNN 2-layer MLP 5-layer CNN CNN
Dataset MNIST MNIST/CIFAR-10 MNIST/CIFAR-10 MNIST MNIST CIFAR-10
Accuracy 98.3% 98.3%/85.7% 98.6%/83.3% 94.7% 96.2% 88.52%
Peak MAC efficiency1 40.3 TOPS W−1 139 TOPS W−1 658 TOPS W−1 10 TOPS W−1 11 TOPS W−1 21.9 TOPS W−1
Reference 22 21 23 28 100 98
1
1 multiply-and-accumulate (MAC) = 2 Operations (OPs).
MNIST, Modified national institute of standards and technology database. CIFAR, Canadian institute for advanced research.
When performing training of a neural network encoded in crossbar that operate on a common clock cycle. Hence, SNNs are ideally
arrays, forward propagation is performed in the same way as for the suited for processing spatio-temporal event-driven information
inference described above. The only difference is that all the activa- from neuromorphic sensors. There has been significant progress
tions xi of each layer have to be stored locally in the periphery. Next, in recent years in designing deep SNNs trained with supervised
backward propagation is performed by inputting the error gradient learning that can perform close to conventional DNNs117. The main
δj from the subsequent layer onto the columns of the current layer approaches rely either on converting weights from a previously
and deciphering the result from the rows. The resulting weighted trained non-spiking DNN118,119, or implementing backpropagation
P training using spike signals on the SNN itself120,121. Recently it has
sum δj Wij needs to be multiplied by the derivative of the neuron
j been shown that a spiking neuron can be transformed into a recur-
nonlinear
I function, which is computed externally, to obtain the error rent neural network unit, and thus it is possible to apply the existing
gradient of the current layer. Finally, the weight update is performed deep learning frameworks for seamless training of any SNN archi-
based on the outer product of activations and error gradients xiδj of tecture with backpropagation through time122. However, most of the
each layer. One approach is to perform a parallel weight update by efforts in applying in-memory computing to SNNs have focused on
sending deterministic or stochastic overlapping pulses from the rows unsupervised learning with local learning rules. The best-known
and columns simultaneously to implement an approximate outer example for this is spike-timing-dependent plasticity (STDP),
product and program the devices at the same time (Fig. 6b)107–111. which adjusts a synaptic weight based on the relative timing
While this parallelism may be efficient in terms of speed, each outer between its output and input neuron spikes. In-memory implemen-
product needs to be applied to the arrays one at a time (either after tations of SNNs have traditionally been done using slow subthresh-
every training example or one by one after a batch of examples), old analogue CMOS circuits that directly emulate the functions of
leading to a large number of pulses applied to the devices. This neurons and synapses, together with fast event-driven digital com-
results in stringent requirements on the device granularity, asym- munication12,123. Support for STDP learning was also successfully
metry and linearity to obtain accurate training109,112, and high device implemented124. Non-volatile nanoscale devices, such as PCM125–128
endurance is critical. Using multiple devices per synapse with a peri- and RRAM129,130, have been proposed to be integrated as part of the
odic carry can relax some of the device requirements, at the price synapse and neuron circuits in a hardware SNN. Support for STDP
of a costly reprogramming of the entire array every time the carry learning with these devices has been generally implemented using
is performed110,111. Another approach is a mixed analogue/digital rather complex schemes based on overlapping pulses. However,
weight update whereby ∆Wij is computed digitally and applied to STDP-based learning rules have still not been able to reach the
the arrays row-by-row or column-by-column (Fig. 6c). ∆Wij can be accuracy of conventional DNNs trained with backpropagation,
applied either at every individual training example (online train- despite significant recent progress131. Although SNNs are believed
ing) or batch of training examples (by accumulating all the updates to be computationally more powerful than conventional DNNs
within one batch in a digital memory)113–115. ∆Wij can also be accu- because of the added temporal dimension, an application where this
mulated in a digital memory across batches and specific devices are advantage is clearly demonstrated and exploited is still lacking. This
programmed when their corresponding accumulated values reach a is one of the reasons why generally SNNs have not been as widely
threshold116. This approach is more flexible than the parallel weight adopted as conventional DNNs. However, with the incorporation
update based on overlapping pulses because it can implement any of additional bio-inspired neuronal and synaptic dynamics132, SNNs
learning rule, not only stochastic gradient descent, and the digi- could transcend conventional deep learning in certain application
tal computation and accumulation of weight updates significantly domains and memristive devices could be exploited to natively
relax the requirements on the device granularity and endurance116. implement such dynamics133.
However, the cost is the need for additional digital computing and
memory hardware. The training approaches presented here are still Stochastic computing and security. The stochasticity associ-
at the stage of functionality demonstration and need to overcome ated with the switching behaviour in memristive devices can also
the device-related challenges before they could be employed on edge be exploited for in-memory computing134. In an MRAM, the MTJ
devices in applications where online learning is desirable. switching is inherently stochastic due to the thermal fluctuations
A third application domain for in-memory computing in deep affecting the free layer and the write voltage and duration can be
learning is spiking neural networks (SNNs). The main difference used to tune the switching probability. In RRAM, if the write volt-
between SNNs and the non-spiking neural networks discussed so age is comparable to VSET, then the SET transition takes place after a
far is that SNN neurons compute with asynchronous spikes that are certain time delay. This delay time exhibits significant cycle to cycle
temporally precise, as opposed to continuous-valued activations statistical variations135. This behaviour is also observed in PCM
a b
Rain Wet Grass
Cloudy 0 1 Springler Rain 0 1
Time Time
Pulse 0 0.8 0.2 0 0 1.0 0.0
Clock Cloudy 1 0 0.1 0.9
VREF 0 1 1 0.2 0.8 0 1 0.1 0.9
1 1 0.01 0.99
0.5 0.5 Rain (R)
Time
Cloudy (C) Wet Grass (W)
Switching
delay Springler (S)
....01000100010...
time Springler
Cloudy 0 1
Comparator ....01000110000...
0 0.5 0.5
1 0.9 0.1 AND
1
P(S = 1/W = 1)
LRS HRS
c d
Memristive
membrane
Spike event
generation V
Output spike
train
Dendrites Axon 0
Challenge
Biology Technology
V
Top electrode
– – – – – – – – –
Phase-change 0
Lipid bilayer cell
+ + + + + + + + + I1 I2 IN-1 IN
BE
Current sense amplifier
Response
Fig. 7 | Stochasticity associated with memristive devices and applications in computing. Resistance switching in memristive devices is intrinsically
stochastic, with an ability to control the stochasticity via the voltage and duration of write pulses. a, Schematic illustration of a circuitry that exploits
memristive stochasticity for the generation of true random numbers141. The device is connected in series with a resistor in a voltage divider configuration.
A write pulse of a certain fixed duration is applied to the device. A SET transition in the device after a stochastic delay time will cause the comparator to
output a 1. The difference between the pulse duration and the delay time is measured by a counter in units of a fixed clock period. Based on whether this
time is an even or odd multiple of the clock period, a 0 or 1 bit is assigned. By applying a sequence of write pulses, a stochastic bit stream is generated.
b, A Bayesian network is shown where each node represents random variables and each link describes the direct dependence among them, quantified
in terms of the transitional conditional probabilities. Such networks can be used to estimate the probability of hidden causes from a given observation.
The required probability distributions to perform such probabilistic inference can be generated efficiently using stochastically switching memristive
devices. For example, the probabilities can be encoded within Poisson distributed binary bit streams generated using MRAM devices146. The associated
computations such as the intersection operation can be implemented by multiplying the two bit streams with an AND gate. c, The stochasticity associated
with the SET process in PCM can be used to realize stochastically firing neurons. The key computational element is the neuronal membrane, which stores
the membrane potential in the phase configuration of a PCM device. These devices enable the emulation of large and dense populations of neurons
for bioinspired signal representation and computation. d, Memristive crossbar arrays can be used to generate physically unclonable functions (PUF).
The broad distribution of resistance values as well as the current sneak paths are exploited to obtain a large set of challenge-response pairs (CRP). For
example, in an NxN crossbar PUF depicted here, the challenge consists of an N-bit vector applied to the N rows. The current from the N columns is then
read and converted to an N-bit response. The theoretical number of CRPs is 2N.
would be a prohibitive loss of computational precision. Another only limits the maximum crossbar size that can be reliably oper-
important challenge is the finite resistance of the crossbar wires. It ated, but also the integration density because of the difficulty to use
can lead to parasitic voltage drops on the devices during readout the metal layers close to the CMOS front-end due to their higher
when a high current is flowing through them (referred to as the IR resistivity. From an architectural point of view, a computational
drop), creating errors in the analog computation results. This not memory unit could have multiple in-memory computing cores