Energy-Efficient Memcapacitor Devices For Neuromorphic Computing
Energy-Efficient Memcapacitor Devices For Neuromorphic Computing
https://doi.org/10.1038/s41928-021-00649-y
Data-intensive computing operations, such as training neural networks, are essential for applications in artificial intelligence
but are energy intensive. One solution is to develop specialized hardware onto which neural networks can be directly mapped,
and arrays of memristive devices can, for example, be trained to enable parallel multiply–accumulate operations. Here we
show that memcapacitive devices that exploit the principle of charge shielding can offer a highly energy-efficient approach for
implementing parallel multiply–accumulate operations. We fabricate a crossbar array of 156 microscale memcapacitor devices
and use it to train a neural network that could distinguish the letters ‘M’, ‘P’ and ‘I’. Modelling these arrays suggests that
this approach could offer an energy efficiency of 29,600 tera-operations per second per watt, while ensuring high precision
(6–8 bits). Simulations also show that the devices could potentially be scaled down to a lateral size of around 45 nm.
B
rain-inspired computing—often termed neuromorphic com- image recognition algorithm. We also assess the potential scalability
puting—based on artificial neural networks and their hard- of our devices for use in large-scale energy-efficient neuromorphic
ware implementations could be used to solve a broad range systems using simulations.
of computationally intensive tasks. Neuromorphic computing can
be traced back to the 1980s (refs. 1,2), but the field gained consid- Memcapacitive device based on charge shielding
erable momentum after the development of memristive devices3 Our memcapacitive device consists of a top gate electrode, a shield-
and the proposal of convolutional layers in deep neural networks ing layer with contacts and a back-side readout electrode (Fig. 1a).
at the algorithmic level4,5. Since then, several resistive neuromor- These layers are separated by dielectric layers. The top dielectric
phic systems and devices have been implemented using oxide layer can have a memory effect, for example, charge trapping or fer-
materials6–8, phase-change memory9, spintronic devices10,11 and fer- roelectric, which may influence the shielding layer, or the shielding
roelectric devices (tunnel junctions12,13 and ferroelectric field-effect layer itself can exhibit a memory effect (in this paper, only the first
transistors (FeFETs)14,15), and such systems—namely, ferroelectric principle is investigated). A very high on/off ratio of electric field
tunnel junctions13 and SONOS (that is, silicon–oxide–nitride– coupling and therefore the capacitance between the gate electrode
oxide–silicon) transistors16—have exhibited energy efficiencies of and readout electrode can be obtained with either total shielding or
up to 100 tera-operations per second per watt (TOPS W–1). All these transmission. The lateral scalability is substantially better compared
approaches rely on the analogue storage of synaptic weights, which with the previously mentioned concepts, since the thickness of each
can be used in multiplication operations, and use Kirchhoff ’s cur- layer can be readily optimized, while the dynamic ratio is mainly
rent law for the summation of currents implemented via crossbar dependent on the shielding efficiency of the shielding layer.
arrays17. Generally, charge screening depends on the Debye screening
Memcapacitive devices18 are similar to memristive devices but length LD:
are based on a capacitive principle, and could potentially offer a
ε0 εr UT
√
lower static power consumption than memristive devices. There
LD = , (1)
have been theoretical proposals for memcapacitor devices18–22, but n2 e
few practical implementations23–26. Memcapacitor devices can be
realized through the implementation of a variable plate distance where UT is the thermal voltage, n is the charge carrier concentra-
concept, as demonstrated in micro-electromechanical systems27, tion, ε0 is the electric field constant, εr is the relative electric field
a metal-to-insulator transition material in series with a dielectric constant and e is the elementary charge. The electric field drops
layer22, changing the oxygen vacancy front in a classical memris- exponentially within the shielding layer and drops to 37% within
tor20, and a simple metal–oxide–semiconductor capacitor with a the screening length LD under the condition Ψ ≪ UT. In practice, in
memory effect24,25. To obtain a high dynamic range, these devices semiconductors, the relationship is highly nonlinear depending on
either have a large parasitic resistive component20 at small plate potential ψ at depth x, as follows:
distances or limited lateral scalability due to large plate distances.
Similar problems occur with memcapacitors having varying surface d2 ψ −e −ψ ψ
( [ ( ) ] [ ( ) ])
p0 exp 1 n 0 exp 1 ,
areas23 or varying dielectric constants26.
=
dx2 ε0 εr UT UT
− − −
In this Article, we report memcapacitor devices based on charge (2)
shielding that can offer high dynamic range and low power opera-
tion. We fabricate devices on the scale of tens of micrometres and where p0 and n0 are the charge carrier concentrations of holes and
use them to create a crossbar array architecture that we use to run an electrons in thermal equilibrium, respectively. Therefore, the Debye
Max Planck Institute of Microstructure Physics, Halle (Saale), Germany. 2SEMRON GmbH, Dresden, Germany. ✉e-mail: [email protected];
1
[email protected]; [email protected]
a b
Spacer
Memory charges
n–
Gate Gate
n+
Memory Memory
SL n+
n– p+
Dielectri p+
c
Readout
Readout electrod electrode
e n+ Poly
Memory
I,Q Ti2Si Passive
dielectric
dielectric
Metal Plug
s
WL
SLs
p
Q
Q
BLs
Fig. 1 | Structure of the memcapacitor device. a, General device structure with a gate electrode, shielding layer (SL) and readout electrode (I, current;
Q, charge). The electric field coupling is indicated by the blue arrow. b, Device structure with a lateral pin junction as well as electron and hole injection.
c, Crossbar arrangement of the device in b, where a.c. input signals are applied to the word lines (WLs) and the accumulated charge is read out at the bit
lines (BLs). During readout, the SL is mostly connected to GND.
screening length (equation (1))—given the exponential spatial applied, and the shielding layer becomes a shielding line (SL) in a
dependence of the field in the material—is only a linear approxi- direction vertical to the WL. The readout electrode functions as the
mation of nonlinear differential equation (2). Especially for strong bit line (BL), which is parallel to the SL, and the accumulated charge
inversion and accumulation within the shielding layer, the length out of one BL is the calculated result of accumulated multiplications
scales of screening become much smaller than the Debye length. at each crossing point. The multiplication is conducted between the
This nonlinearity with respect to the applied gate voltage or charge input signal of the WL and the state of the shielding layer, which, in
stored in the memory dielectric leads to either strong shielding or turn, is adjusted by the memory material. The weights are encoded
fairly good transmission. in the capacitance of each crossing point. In contrast to resistive
A more detailed device structure is shown in Fig. 1b with lat- devices, capacitive devices only react on dynamic voltage or current
eral p+n–n+ junctions in the shielding layer. The p+- and n+-doped signals; therefore, an alternating current (a.c.) voltage is applied to
regions act as reservoirs for electrons and holes, respectively, and the WL during readout. Writing of the memory material is achieved
can inject each carrier type for the purposes of shielding. This by a voltage difference between the SL and WL.
enables additional device functionality; however, more importantly,
it also allows a symmetric device response for positive and nega- CV curves and gradual programming of single devices
tive gate voltages. This is a crucial feature for neuromorphic devices, Single devices on the micrometre scale were fabricated on a
because the weight update is then undistorted and the training silicon-on-insulator wafer, whereas the handle wafer containing a
accuracy is thus higher17. During readout, the shielding layer is con- highly n-doped epitaxial layer acts as the readout electrode and the
nected to the ground (GND). During writing and training, the volt- buried oxide acts as the bottom dielectric layer. As a memory prin-
ages applied to the p+ and n+ contacts can differ and can also act as ciple, ferroelectric-assisted charge trapping (polarization charge
a selector, as explained in Supplementary Section 1. As shown in attracts carriers and thus promotes trapping) was used to combine
Fig. 1c, the single device can be arranged into a crossbar for highly the advantages of both principles28,29, whereas the tunnelling oxide
parallel multiply–accumulate (MAC) operations. In this case, the was 2.5 nm thick to avoid charge detrapping. Details of the fabrica-
gate electrode becomes the word line (WL), where input signals are tion can be found in Methods.
C (pF)
C (pF)
+5.2 V –5.0 V
20 VAK = –3.0 V
VAK = –4.0 V 10 LTP
G 10 LTD
d.c.
d.c. 0 0
n+ (K) 0 10 20 30 40 50
240 µm
2
–6
3
–2
5
6
–3
–1
7
–4
–7
–5
VG (V) N
R
C (pF)
f
C (pF)
Pulse length 20
40
20 LTP
10 LTD
30
C (pF)
+5.5 V –5.5 V 0 0
20 –5 –4 –3 –2 –1 0 1 2 3 4 5 0 20 40 60 80
VG (V) N
10 LTP Pulse height
LTD +4.0 V +6.1 V –4.0 V –6.1 V
0
0 10 20 30 40 g Vpgr = 4.5 V Pulse number
N 30 Vpgr = 5.0 V
Vpgr = 5.2 V
0.1 ms 2.0 ms 0.1 ms 2.0 ms
20
C (pF)
10
0
0 5 10 15 20 25
N
Fig. 2 | Measurement setup and CV curves of single devices. a, Microscopy image of the measured single device and measurement setup. b, Measured
CV curves for a device without memory at different VAK values; VAK is applied antisymmetrically, the d.c. voltage of the gate was swept between −7 and
7 V, and the small a.c. voltage had an amplitude of 100 mV with a frequency of 1 kHz. c, CV curve shifting due to the injection of charges. The device had
a memory in this case. d–f, Analogue value writing with pulse number modulation (constant write height) (d), pulse height modulation (the voltage is
increased/decreased from ±4.0 to ±6.1 V) (e) and pulse length modulation (f). In d–f, the shielding layer was grounded, and readout was performed
between each pulse with an a.c. signal, as shown in c. g, Pulse number modulation for different write pulse heights.
The fabricated devices had a gate length ranging from 10 to Replacing the normal silicon dioxide dielectric with a memory
60 µm, and the gate width was enlarged by winding it around dielectric and with a CV sweep from −5 to 5 V, one can observe
several highly p+- and n+-doped finger-shaped regions, thus a shifting of the capacitive coupling window with a memory win-
forming several parallel pin junctions. The larger area leads to dow of 2.7 V (Fig. 2d), while the pin junction was grounded. Due
a readily detectable capacitance and the minimum capacitance to the shifting direction, one can conclude that charge trapping
of turned-off devices could also be precisely measured (capaci- is the memory principle (for purely ferroelectric switching, the
tive dynamic range). Figure 2a shows a microscopic image of the curves would shift in the opposite direction). By contrast, capacitive
fabricated device. Capacitance–voltage (CV) measurements were devices can only be read out by a.c. voltages or current signals. For
carried out by applying an a.c. signal with a direct current (d.c.) this reason, an alternating voltage (0.5 V) is applied to the gate for
bias (sweep) to the gate: the resulting a.c. current of the readout readout, together with a bias voltage (1.0 V) to adjust the readout
electrode was measured either by lock-in amplification or by an window, as indicated by the shaded area in Fig. 2d (note that the pin
oscilloscope and current pre-amplifier. Data from the resulting junction is grounded during readout). In Supplementary Fig. 11a,b,
fundamental CV curves for different d.c. voltages (VAK) on the the readout current of a written and erased cell is shown, and a
n+ and p+ regions are shown in Fig. 2b (note that a normal sili- capacitive dynamic range of ~1:1,478 was experimentally achieved.
con dioxide dielectric layer was used here instead of a memory To store analogue values, one can apply short pulses with the
dielectric). The CV curves get broader or are nearly extinguished same amplitude (Fig. 2d,g), apply pulses with increasing height
depending on whether the pin junction is used in the reverse (Fig. 2e) or change the pulse length (Fig. 2f) applied to the gate. The
or forward bias direction, respectively; this behaviour is further resulting curves exhibit some similarities to those obtained from
explained in Supplementary Section 1. Generally, a capacitive pure ferroelectric switching14, indicating the ferroelectric assistance
coupling window is observed, which is high for depletion (and in the memory storage process. The curve in Fig. 2d shows a typical
therefore for transmission through the shielding layer) and low nonlinear long-term potentiation (LTP) curve with an exponential
during inversion or accumulation. The curves are derivatives of dependence.
a sigmoid curve, which play an important role in modelling neu-
rons in artificial neural networks. A direct measurement of the
( ( ))
−Npgr
sigmoid curve and further uses are explained in Supplementary CLTP = Cmin + ΔC 1 − exp (3)
β pgr
Section 1.
The same applies for the long-term depression (LTD) Regarding the implemented training algorithm, the Manhattan
update8,30 rule was chosen, due to its simplified training procedure.
−Ner
( ( ))
CLTD = Cmax − ΔC 1 − exp , (4) In conventional backpropagation training, the weight update is cal-
β er culated as follows:
where Npgr and Ner denote the number of programming or erase ΔWij = −αδi (n) Xj (n) , (6)
pulses, respectively; βpgr and βer are the stretching factors; and Cmin
and Cmax denote the minimum and maximum capacitance, respec- where α describes the learning rate, δi(n) is the backpropagated
tively. Here ΔC describes the maximum change in capacitance. error and Xj(n) is the current input for the nth input image, which
Changing the write pulse height of the pulse number modulation is randomly chosen from the training set. The weights are updated
leads to more flattened or steepened curves (Fig. 2g). Write/erase after each sample (stochastic training). The backpropagated error
pulse height modulation (Fig. 2e) can lead to relatively symmetric for a one-layer perceptron can be calculated as follows:
and—in certain regions, linear—behaviour with respect to the pulse df
height steps. This is highly beneficial for implementing neuromor- d i
δi (n) = fi (n) − fi (n) , (7)
phic algorithms17. Pulse length modulation shows similar behaviour dv v=vi (n)
to pulse number modulation (Fig. 2f). In Supplementary Fig. 11c,
the measured readout current is illustrated for LTP and LTD for where fdi (n) is the desired output value and fi(n) is the current out-
different pulse numbers of pulse height modulation (Fig. 2e) and put. Function fi is related to the voltage output vi(n) of the ith sense
reveals the pinch-off and increase. amplifier and the activation function of the neuron (in this case,
Other memory parameters, like device-to-device variation, tanh):
endurance and retention can be found in Supplementary Section 9.
fi (vi ) = tanh (κvi (n)) , (8)
Crossbar array and implementation of training algorithm
Crossbar devices—used to execute an image recognition algo- where κ is the steepness factor. With the Manhattan update rule,
rithm—were fabricated and wire bonded onto a chip carrier. A the weight update from equation (6) is coarse-grained by using the
printed circuit board (PCB) was designed and controlled by a data following signing.
acquisition system. An image of the fabricated chip with the bond-
ing pads, a zoomed-in microscopy image of the crossbar and a scan- ΔWM
ij = sgnΔWij (9)
ning electron microscopy image are shown in Fig. 3a. Each memory
cell had a size of 50 × 50 µm2. Therefore, all the weights are updated by the same amount
A schematic of the device cross section is shown in Fig. 3b. The based on their sign. Figure 4a illustrates the pulse scheme for
BLs of the memory array were separated by refilled deep trenches. implementing the algorithm. The term δi (n) Xj (n) in equation
Details of the fabrication process can be found in Methods. (6) becomes positive if both error δi(n) and input Xj(n) are posi-
The matrix comprised 26 WLs and 6 BLs (Fig. 3c). A differential tive or it becomes negative for the opposite sign if both δi(n) and
weight topology17 was used with the positive and negative value of Xj(n) are negative . Hence, one can describe this by an XNOR
each weight separated in two memory cells. The values of these two combination. To update the weights, the error signal is applied to
BLs were subtracted from each other. the SL, as shown in Fig. 4a. The corresponding input signals are
applied to the WL. The differential signal at the crossing points
Wij = C+
ij − Cij
−
(5) follows the XNOR operation, while the specific signals (shown in
Fig. 4a) ensure that the maximum disturbance level is not higher
The input values are separated by a sign with a 180° phase shift. than 1/3 and thus effectively prevents the overwriting of cells in
For the desired ‘four-quadrant multiplication’ (input × weight), the same column or row (the memory cell acts as the selector
a global clock signal is used together with the switched capacitor itself; see Supplementary Sections 7 and 8). As a 5 × 5 image rec-
approach (Fig. 3c). Further details are explained in Supplementary ognition task, the letters M, P and I were chosen, and one pixel
Section 11. The integration capacitance of the amplifier is charged in each of the samples was flipped, which results in a total set of
up in each period of the input sine signal, and hence, the number 78 samples. These pseudo-images were separated into a test and
of periods (Nper) encodes the value of the input signal. This effect training set; the test images are indicated by a blue frame (Fig. 4b).
also leads to an averaging of the noise level and improvement in The resulting misclassified images versus training epochs for
the signal-to-noise ratio, as explained later. This theoretical concept the training and test images are shown in Fig. 4c. Evidently, the
of ‘four-quadrant multiplication’ was confirmed with the follow- number rapidly decreases after one training epoch and stays
ing measurement (Fig. 3d): the input number of periods (Nper) and almost zero throughout the training epochs. Figure 4d shows the
the number of programming pulses (Npgr), which adjust the actual obtained mean neuron activations for the three classifications
weight, were varied in positive and negative values, while the output over the training epochs. The slightly higher simulated average
voltage is read. Positive and negative Nper values were encoded by a misclassification rate (Fig. 4c) is the consequence of single steep
180° phase shift and positive/negative programming pulses (Npgr) climbs of the misclassification rate after an arbitrary number of
only changed the positive/negative weights, while the counterpart epochs with 100% accuracy in some runs. Misclassifications after
was in an erased state. Supplementary Fig. 12a,b shows the cross epoch 1 are caused by the very similar expected value for indi-
sections of the 3D plot in Fig. 3d. The curves along the input period vidual presynaptic neurons for letters M and P. Measurements
number behave in a highly linear manner, and this linearity was also confirm the more stable results for the classification of letter
also confirmed for the accumulation operation (Supplementary Fig. I, as shown in Fig. 4d. The results are in accordance with other
12c), demonstrating a highly linear MAC operation with the pro- studies7,8.
posed switched capacitor approach. Thus, experimental results on micrometre-sized devices dem-
The first 25 WLs enable a vectorized input feature map for images onstrate the working principle. For demonstrating scalability to
of 5 × 5 pixels; thus, one single fully connected layer is carried out. the nanometre regime and superior energy efficiency, detailed and
Dark pixels are represented by positive values and bright pixels, by extensive simulations were performed, which are explained in the
negative values. The bias input is mapped to the 26th WL. upcoming sections.
GND
cell n+
Trench
BL Hf0.5Zr0.5O2
WL Aluminium
500 µm
n– SU-8 resist
TiN
p+
50 µm
c
Read,
write
WL1
WL2
WL3 d 2
WL4 1
2 0
WL5
1
V3 (V)
0 –1
–1
WL24 –2
–2
–8 20
WL25
–4 10
N 0 0
WL26 per 4 –10
– ve
Bias φ= 8 –20 siti
180 Po
° BL
ive
+ – + – + – φ= gat
0° Ne BL
Clock – + – + – +
I I I I I I
V1 V2 V3
Fig. 3 | Crossbar arrangement and fundamental measurements. a, Wire-bonded chip with microscopy and scanning electron microscopy images.
b, Device cross section. c, Neuromorphic system for accomplishing ‘four-quadrant multiplication’: positive and negative inputs are 180° phase shifted
with each other. The a.c. conditions are the same as in Fig. 2, and the number of periods encodes the amount of input. The clock signal is high for a
rising edge in the positive signal and the switches are in the left position during a high clock signal. The SL is connected to GND during readout.
d, Measured ‘four-quadrant multiplication’ for different input period numbers Nper and programming pulse numbers (pulse number modulation) Npgr.
For negative Nper, the input signal is 180° phase shifted, and for positive Npgr, a positive BL is programmed; a negative BL is kept in an erased state (vice
versa for negative Npgr).
TCAD simulations on single devices A dynamic range of 1:60–1:90 is sufficient to achieve a precision
A device with 90 nm gate length (Fig. 5a) was simulated by Synopsys. of 6–8 bits31.
Figure 5b (where no memory dielectric was integrated for the first Including a memory window (~3 V for charge-trapping memo-
simulations) shows the CV curves of the coupling capacitances ries and ~1–2 V for ferroelectric memories depending on the thick-
between the gate and readout electrode with respect to the applied ness and coercive field) leads to shifted CV curves (Fig. 5d). The a.c.
gate voltage (VG), which are consistent with the observed experi- readout voltage is indicated in Fig. 5d; for the positive shifted curve,
mental behaviour (Fig. 2b). the resulting readout current and therefore the accumulated charge
The ratio between the maximum capacitance and lower-state will be very large. The total readout charge over one-half period
capacitance obtained by shifting the gate voltage by 3 V is 1:90 in of the applied sinusoidal signal versus memory shift is shown in
this device, and this ratio can be further enlarged by using thin- Fig. 5e. Most of the negative memory window is used for turning
ner gate oxides or larger gate lengths, as shown in Fig. 5c. In gen- off the device.
eral, the capacitive ratio decreases with a smaller gate length due
to the fact that the influence of the space charge region becomes Scalability to 45 nm
more pronounced for smaller gate lengths (short channel effect) With regard to lateral scalability, it is necessary to distinguish three
and sufficient shielding is hard to achieve in this region (Fig. 5c, aspects: (1) the scalability of the memory technology in the top
inset). By using high-κ dielectrics for the top and bottom oxides, a dielectric itself with regard to how many levels can be stored; (2) the
ratio of 1:60 was obtained for a 45 nm device with the same capac- sensitivity of the sense amplifier at the end of each BL for detecting
itance as the 90 nm device, as shown in Supplementary Section 2. the accumulated charge; (3) the noise level of one single device during
a b
SL +1 +1
WL ±1/3 + 0 – 0
±2/3,0 –1
0 2 4
–1
0 2 4
+1 +1 +1
+ 0 + 0 – 0
–1 –1 –1
0 2 4 0 2 4 0 2 4
+1 +1 +1
– 0 – 0 + 0
–1 –1 –1
0 2 4 0 2 4 0 2 4
c 40
Training set (measured)
30 Training set (simulated, averaged)
Test set (measured)
Test set (simulated, averaged)
Nmis
20
10
0
0 2 4 6 8 10
Nepoch
d 1.0 1 1.0
f2
f3
I I
–0.5 –0.5 –0.5
–1.0 –1 –1.0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Nepoch Nepoch Nepoch
Fig. 4 | Manhattan update training on crossbar. a, Pulse scheme to enable XNOR operation during Manhattan weight update (the write/erase pulse
height was ±5.2 V and length was 1 ms). The disturb level is exactly 1/3 of the write/erase voltage. b, Training and test set of the letters M, P and I with one
flipped pixel. The test images are framed in purple. c, Number of misclassified images Nmis for the training and test sets over ten training epochs (Nepoch).
The measured curve is compared with the simulated curves. d, Average artificial neuron activation for three classifications (f1, f2 and f3) and three images
over ten training epochs.
readout. Fairly common resolutions for input, weight and output sig- wise charge integration over many periods. For the device shown
nals for neural networks are in the range of 4–8 bits (16–256 levels)31. in Fig. 5, Nper = 142 periods is necessary, which fits well into a range
This analogue-like resolution has a significant influence on scalabil- of 7–8 bits of the input signal (Supplementary Section 3). Note that
ity. Typically, lower precision is needed for inference tasks. 128 periods are sufficient for an 8-bit signed integer due to the use
With respect to the memory material, one can generally con- of the 180° phase shift for negative values of the switched capacitor
clude that charge-trapping memories (for example, SONOS) have approach.
shown up to 31 levels down to 40 nm (ref. 16). The disadvantage of Regarding the noise level of capacitive devices, one has to con-
this memory technology is the relatively high write energy and slow- sider kTC noise.
ness during writing (millisecond regime). However, SONOS might √
be an alternative for inference-only applications. On the other hand, kB T
vn = (10)
hafnium oxide (a ferroelectric) has very low write energies and is C
fast (nanosecond to microsecond regime). Ongoing research is still
underway on the scalability of ferroelectric memories with regard to where kB defines the Boltzmann constant, T the temperature and C
analogue storage. From FeFETs, it is known that they tend to show the capacitance. For a 6.65 aF device (Fig. 5d), one obtains a noise
abrupt switching events below 500 nm, which is attributed to the voltage of 25.00 mV (at room temperature), which is 14 times lower
limited grain size15. than the effective readout value of 0.35 V. However, one has to con-
Regarding capacitive measurement resolutions, some work was sider that the noise level decreases with the number of repetitive
done in the context of DNA sensing and chip interconnect measure- measurements, namely, 1/ Nper , which results in a noise level of
√
ments with resolutions down to <10 aF (charge-based capacitive 2.20 mV (at room temperature) or 169 times lower than the effective
measurements, capacitance-to-frequency conversion and lock-in readout value; this defines a precision of ~7 bits. Based on this mini-
detection)32–36. These are similar to a conventional sense ampli- mum amplitude necessary to distinguish between different levels, it
fier37,38 and contain an integration capacitor that is charged either also becomes possible to assess the theoretical energy efficiency of
by an operational amplifier circuit or a current mirror. Details on resistive and capacitive devices in general (Supplementary Section
the sensitivity calculation can be found in Supplementary Section 3; 4): capacitive devices are at least eight times more energy efficient
generally, however, one has to consider that in neuromorphic than resistive devices.
devices, the accumulated charge from many memory cells (several
hundreds to thousands) is read out at once and used for further Simulation of ultrahigh energy efficiency
information processing, which gives rise to much larger charges Much of the energy sourced to ‘memcapacitors’ can be recov-
compared with only one cell. Furthermore, several pulse/period ered since it is stored in the capacitor; this is an important differ-
numbers are used for encoding the input value and leads to step- ence from resistors in which the readout operation is inherently
6 VAK = –1 V 2
7.392 × 1017
7 nm 10 5.463 × 1014
+ 4.43 V VAK = 0 V
p 20 nm n+ 4.037 × 1011
Cmix/Cmin
C (aF)
VAK = 1 V
2.983 × 108
n+
1.629 × 102
EOT = 5 nm
2 + –
100 EOT = 7 nm
EOT = 10 nm
270 nm
EOT = 20 nm
0 EOT = 50 nm
d 8 e 8 f
Memory window (3 V)
60 Experiment
6 6 Simulation
Qacc (aC)
C (aF)
40
C (pF)
4 4
2 2 20
0 0 0
–5 –4 –3 –2 –1 0 1 2 3 4 5 –1.5 –1.0 –0.5 0 0.5 1.0 1.5 –5 –4 –3 –2 –1 0 1 2 3 4 5
VG (V) Vshift (V) VG (V)
Fig. 5 | TCAD simulation results. a, Simulated structure with gate length Lg = 90 nm. b, Obtained CV curves with respect to the gate voltage for different
voltages VAK along the p+n–n+ diode (quasi-static simulation). The voltage VAK was applied antisymmetrically, as that in Fig. 2. c, Capacitive dynamic ratio
(maximum capacitance/minimum capacitance of the CV curves with p+n–n+ connected to GND) for different gate lengths and gate oxide thicknesses. The
inset shows the electron density, and the short channel effect becomes obvious. EOT, equivalent oxide thickness. d, Shifting of the CV curves for VAK = 0 V
for different memory charges in the gate oxide. Note the applied readout a.c. signal with bias. e, Accumulated charge (Qacc) for different voltage shifts (Vshift;
caused by memory charges) over one-half period of the a.c. signal in d. f, Comparison of the simulated and experimental capacitive coupling curves for the
micrometre-scaled device shown in Fig. 2.
dissipative due to Joule heating. The energy fed in during charg- on the Modified National Institute of Standards and Technology
ing can be, in principle, recovered during discharging. This concept (MNIST) database, the energy efficiency is 29,600 TOPS W–1 includ-
of energy recovery is also present in adiabatic circuit designs39,40, ing charge recovery (Supplementary Section 6). Without recovery,
which are at the core of the reversible computing paradigm41,42. The the efficiency amounts to 1,702 TOPS W–1 for MNIST.
limiting factor of energy recovery in adiabatic circuits are resistive
losses in the circuit, as well as in the inductances used for the power Comparison of simulation and experimental results
clock generators. The inductances have limited quality factors (q To verify the functionality of the simulator, we performed simula-
factor) in the order of dozens to hundreds. In common adiabatic tions of the device with 60 µm gate length (Fig. 2). As shown in Fig. 5f,
realizations, energy recovery of the supply clock generators is of the experimental data from Fig. 2d match well with the simulated data.
order of 95% for harmonic signals43–45, which means the supplied As shown in Supplementary Fig. 14, we measured the gate charg-
active power is q = 20 times lower than the reactive power. ing current together with the applied readout a.c. voltage for the
To estimate the time delay, areal efficiency and energy efficiency single device (Fig. 2), and a perfect 90° phase shift is visible. From
(Table 1) of a realistic crossbar arrangement (including parasitic ele- the curves, we can calculate the reactive (WR) power consumption
ments), a SPICE model (Supplementary Fig. 4a) for the 90 nm device per period (using equations 31–33, Supplementary Section 5) and
was developed (Supplementary Section 5). One can conclude that obtain Wr = 3.22 nJ per period. Furthermore, for 142 periods, as in
extremely fast readout transitions can suppress shielding in the SL, the simulation, we obtain the total reactive energy for one MAC
since charge cannot be supplied any longer (silicide lines are a criti- operation, namely, Wr,tot = 457 nJ per cell. If we scale this value
cal resistive path). In the table, the energetically worst-case scenario by seven orders of magnitude, we obtain Wr,scaled = 45.7 fJ per cell
was assumed: all the WLs are activated at once and all the weights (capacitance shown in Fig. 2d is seven orders of magnitude lower
are zero with a resulting shielding effect, which, in turn, would lead compared with the capacitance of the simulated 90 nm device
to charging in the top gate oxide. Table 1 summarizes the minimum shown in Fig. 5b).
period of time for different matrix sizes, which is proportional to This value is approximately ten times higher than the value
the RC delay, with R being the resistance and C the capacitance. The shown in Table 1 (5 fJ per cell). One has to consider that the
areal efficiency Aη in TOPS mm–2 can be derived from the memory thickness of the buried oxide of the experimental devices is much
footprint (2 × 8 F2), assuming differential weights and the earlier thicker (190 nm) than in the case of the 90 nm device simulation
mentioned time delay. The active (Wp) and reactive (Wr) energy per (15 nm), leading to a 12.7 times lower readout capacitance/area
cell for 142 periods is also summarized in Table 1. With this esti- at approximately the same gate oxide capacitance/area. Also con-
mate in mind, we can conclude a minimum energy efficiency ηrec sidering the different device silicon thicknesses, one can obtain
of 3,452.6 TOPS W–1 in the worst-case scenario for 0% input signal a corrected reactive energy of Wr,scaled,corr = 5.84 fJ cell, which is
sparsity and 100% weight sparsity and an energy recovery of 95% very close to the value shown in Table 1. Other influencing phe-
(Supplementary Section 5). Without any charge recovery, the energy nomena during scaling, like short channel effects (Fig. 5c), quan-
efficiency η would amount to 198.5 TOPS W–1. In a realistic neu- tum confinement and band-to-band tunnelling, are explained in
ral network scenario, for example, a one-layer perceptron trained Supplementary Section 10.
Table 1 | Results on areal and energy efficiency obtained from SPICE simulation
Array size Period Tper (total time delay Aη (TOPS mm–2) (142 Wr (fJ) per cell* (Wp (fJ) ηrec (TOPS W–1)* (η (TOPS W–1))*
(142 periods)) periods + 16 F2) per cell)*
100 × 100 1.00 ns (142.00 ns) 108.70 TOPS mm–2 5.000 fJ per cell 3,782.20 TOPS W–1
(0.015 fJ per cell) (199.51 TOPS W–1)
500 × 500 15.00 ns (2.13 µs) 7.25 TOPS mm–2 5.000 fJ per cell 3,676.80 TOPS W–1
(0.022 fJ per cell) (199.19 TOPS W–1)
1,000 × 1,000 30.00 ns (4.25 µs) 3.62 TOPS mm–2 5.000 fJ per cell 3,452.60 TOPS W–1
(0.040 fJ per cell) (198.54 TOPS W–1)
2,500 × 2,500 200.00 ns (28.40 µs) 0.54 TOPS mm–2 5.000 fJ per cell 3,461.70 TOPS W–1
(0.039 fJ per cell) (198.59 TOPS W–1)
*All cells are erased (worst-case scenario), 95% energy efficiency of power clock source Necessary time period Tper and resulting areal efficiency Aη for different matrix sizes. The reactive energy during the
readout of arrays, Wr, and active energy, Wp, are obtained from simulations (Supplementary Section 5). The energy is presented per cell and for 142 periods. From this number and assuming a 95% energy
recovery of the power source, energy efficiency ηrec (in TOPS W–1) can be calculated for the energetically worst-case scenario (erased state). The same applies for energy efficiency η without recovery.