In-Memory Hyperdimensional Computing
In-Memory Hyperdimensional Computing
https://doi.org/10.1038/s41928-020-0410-3
Hyperdimensional computing is an emerging computational framework that takes inspiration from attributes of neuronal cir-
cuits including hyperdimensionality, fully distributed holographic representation and (pseudo)randomness. When employed
for machine learning tasks, such as learning and classification, the framework involves manipulation and comparison of large
patterns within memory. A key attribute of hyperdimensional computing is its robustness to the imperfections associated with
the computational substrates on which it is implemented. It is therefore particularly amenable to emerging non-von Neumann
approaches such as in-memory computing, where the physical attributes of nanoscale memristive devices are exploited to
perform computation. Here, we report a complete in-memory hyperdimensional computing system in which all operations are
implemented on two memristive crossbar engines together with peripheral digital complementary metal–oxide–semiconductor
(CMOS) circuits. Our approach can achieve a near-optimum trade-off between design complexity and classification accuracy
based on three prototypical hyperdimensional computing-related learning tasks: language classification, news classification
and hand gesture recognition from electromyography signals. Experiments using 760,000 phase-change memory devices per-
forming analog in-memory computing achieve comparable accuracies to software implementations.
B
iological computing systems trade accuracy for efficiency. reasoning14. In the field of robotics, HDC has been employed for
Thus, one solution to reduce energy consumption in artifi- learning sensorimotor control for active perception in robots10.
cial systems is to adopt computational approaches that are In traditional forms of computing, HDC has been proposed for
inherently robust to uncertainty. Hyperdimensional computing efficient representation of structured information15 as well as the
(HDC) is one such framework and is based on the observation synthesis and execution of finite state automata16 and variants of
that key aspects of human memory, perception and cognition can recurrent neural networks17.
be explained by the mathematical properties of hyperdimensional HDC begins by representing symbols with i.i.d. hypervectors
spaces comprising high-dimensional binary vectors known as that are combined by nearly i.i.d.-preserving operations, namely
hypervectors. Hypervectors are defined as d-dimensional (where binding, bundling and permutation, and then stored in associative
d ≥ 1,000) (pseudo)random vectors with independent and identi- memories (AMs) to be recalled, matched, decomposed or reasoned
cally distributed (i.i.d.) components1. When the dimensionality is about. This chain implies that failure in a component of a hypervec-
in the thousands, a large number of quasi-orthogonal hypervec- tor is not contagious and forms a computational framework that is
tors exist. This allows HDC to combine such hypervectors into new intrinsically robust to defects, variations and noise18. The manipula-
hypervectors using well-defined vector space operations, defined tion of large patterns stored in memory and its inherent robustness
such that the resulting hypervector is unique, and with the same make HDC particularly well suited to emerging computing para-
dimension. A powerful system of computing can be built on the rich digms such as in-memory computing or computational memory
algebra of hypervectors2. Groups, rings and fields over hypervectors based on emerging nanoscale resistive memory or memristive
become the underlying computing structures with permutations, devices19–23. In one such work, a 3D vertical resistive random access
mappings and inverses as primitive computing operations. memory (ReRAM) device was used to perform individual opera-
In recent years, HDC has been employed in a range of applica- tions for HDC24,25. In another work, a carbon-nanotube field-effect
tions, including machine learning, cognitive computing, robotics transistor-based logic layer was integrated with ReRAMs, improv-
and traditional computing. It has shown significant promise in ing efficiency further26. However, these architectures offered only
machine learning applications that involve temporal patterns, such limited applications such as a single language recognition task24,26 or
as text classification3, biomedical signal processing4,5, multimodal a restricted binary classification version of the same task26, and their
sensor fusion6 and distributed sensors7,8. A key advantage is that evaluation is based on simulations and compact models derived
the training algorithm in HDC works in one or few shots: that is, from small prototypes with only 256 ReRAM cells24 or a small 32 bit
object categories are learned from one or few examples, and in a datapath for hypervector manipulations that results in three orders
single pass over the training data as opposed to many iterations. In of magnitude higher latency overhead26.
the highlighted machine learning applications, HDC has achieved In this Article, we report a complete integrated in-memory HDC
similar or higher accuracy with fewer training examples compared system in which all the operations of HDC are implemented on two
to support vector machines (SVMs)4, extreme gradient boosting9 planar memristive crossbar engines together with peripheral digital
and convolutional neural networks (CNNs)10, and lower execu- CMOS circuits. We devise a way of performing hypervector binding
tion energy on embedded CPU/GPUs compared to SVMs11, CNNs entirely within a first memristive crossbar using an in-memory read
and long short-term memory5. Applications of HDC in cognitive logic operation and hypervector bundling near the crossbar with
computing include solving Raven’s progressive matrices12, func- CMOS logic. These key operations of HDC cooperatively encode
tional imitation of concept learning in honey bees13 and analogical hypervectors with high precision, while eliminating the need to
1
IBM Research – Zurich, Rüschlikon, Switzerland. 2Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland.
✉e-mail: [email protected]; [email protected]
repeatedly program (write) the memristive devices. In contrast, by 1 bit. The encoder then bundles several such n-gram hypervec-
previous work on HDC using memristive devices did not employ tors from the training data using component-wise addition followed
in-memory logic operations for binding; instead, a ReRAM-based by a binarization (majority function) to produce a prototype hyper-
XOR lookup table24 or digital logic26 was used. Moreover, the pre- vector for the given class. The overall encoding operation results in
vious in-memory compute primitives for permutation24 and bun- c, d-dimensional prototype hypervectors (referred to as associative
dling26 resulted in repeated programming of the memristive devices, memory (AM)) assuming there are c classes.
which is prohibitive given the limited cycling endurance. When inference or classification is performed, a query hyper-
In our architecture, an AM search is performed using a second vector (for example, from a text of unknown language) is gener-
memristive crossbar for in-memory dot-product operations on the ated identically to the way the prototype hypervectors are generated.
encoded output hypervectors from the first crossbar, realizing the full Subsequently, the query hypervector is compared with the proto-
HDC system functionality. Our combination of analog in-memory type hypervectors inside the AM to make the appropriate classifica-
computing with CMOS logic allows continual functioning of the tion. Equation (2) defines how a query hypervector Q is compared
memristive crossbars with desired accuracy for a wide range of against each of the prototype hypervector Pi out of c classes to
multiclass classification tasks. We verify the integrated inference find the predicted class with maximum similarity. This AM search
functionality of the system through large-scale mixed hardware/ operation can, for example, be performed by calculating the inverse
software experiments, in which up to 49 d = 10,000-dimensional Hamming distance:
hypervectors are encoded in 760,000 hardware phase-change mem-
ory (PCM) devices performing analog in-memory computing. Our d
X
experiments achieve comparable accuracies to the software baselines ClassPred ¼ arg max QðjÞPi ðjÞ ð2Þ
and surpass those reported in previous work on an emulated small i2f1;:::;cg j¼1
ReRAM crossbar24. Furthermore, a complete system-level design of
the in-memory HDC architecture synthesized using 65 nm CMOS One key observation is that the two main operations presented
technology demonstrates >6× end-to-end reductions in energy above, namely the encoding and AM search, are about manipu-
compared with a dedicated digital CMOS implementation. With lating and comparing large patterns within the memory. Both IM
our approach, we map all operations of HDC either in-memory or and AM (after learning) represent permanent hypervectors stored
near-memory and demonstrate their integrated functionality for in the memory. As a lookup operation, different input symbols
three specific machine learning related tasks. activate the corresponding stored patterns in the IM that are then
combined inside or around memory with simple local opera-
The concept of in-memory HDC tions to produce another pattern for comparison in AM. These
When HDC is used for learning and classification, a set of i.i.d., component-wise arithmetic operations on patterns allow a high
hence quasi-orthogonal hypervectors, referred to as basis hypervec- degree of parallelism as each hypervector component needs to
tors, are first selected to represent each symbol associated with a communicate with only a local component or its immediate neigh-
dataset. For example, if the task is to classify an unknown text into bours. This highly memory-centric aspect of HDC is the key
the corresponding language, the symbols could be the letters of the motivation for the in-memory computing implementation pro-
alphabet. The basis hypervectors stay fixed throughout the compu- posed in this work.
tation. Assuming that there are h symbols, fsi gh1 , the set of the h, The essential idea of in-memory HDC is to store the compo-
d-dimensional basis hypervectors fBi gh1 is referred
I to as the item nents of both the IM and the AM as the conductance values of
memory (IM) (Fig. 1). Basis hypervectors
I serve as the basis from nanoscale memristive devices29,30 organized in crossbar arrays and
which further representations are made by applying a well-defined enable HDC operations in or near to those devices (Fig. 1). The
set of component-wise operations: addition of binary hypervec- IM of h rows and d columns is stored in the first crossbar, where
tors [+] is defined as the component-wise majority, multiplica- each basis hypervector is stored on a single row. To perform
tion (⊕) is defined as the component-wise exclusive-OR (XOR) operations between the basis hypervectors for the n-gram encod-I
(or equivalently as the component-wise exclusive-NOR (XNOR)) ing, an in-memory read logic primitive is employed. Unlike the
and permutation (ρ) is defined as a pseudo-random shuffling of majority of reported in-memory logic operations31–33, the proposed
the coordinates. Applied on dense binary hypervectors where each in-memory read logic is non-stateful and this obviates the need for
component has equal probability of being zero or one27, all these high write endurance of the memristive devices. Additional periph-
operations produce a d-bit hypervector resulting in a closed system. eral circuitry is used to implement the remaining permutations
Subsequently, during the learning phase, the basis hypervectors and component-wise additions needed in the encoder. The AM
in the IM are combined with the component-wise operations inside of c rows and d columns is implemented in the second crossbar,
an encoder to compute, for example, a quasi-orthogonal n-gram where each prototype hypervector is stored on a single row. During
hypervector representing an object of interest28, and to add n-gram supervised learning, each prototype hypervector output from the
hypervectors from the same category of objects to produce a pro- first crossbar is programmed into a certain row of the AM based on
totype hypervector representing the entire class of category. In the the provided label. During inference, the query hypervector output
language example, the encoder would receive input text associated from the first crossbar is input as voltages on the wordline driver, to
with a known language and would generate a prototype hypervec- perform the AM search using an in-memory dot product primitive.
tor corresponding to that language. In this case, n determines the Because every memristive device in the AM and IM is reprogram-
smallest number of symbols (letters in the example) that are com- mable, the representation of hypervectors is not hardcoded, unlike
bined while performing an n-gram encoding operation. When the refs. 24–26, which used device variability for projection.
encoder receives n consecutive symbols, {s[1], s[2], …, s[n]}, it pro- This design ideally fits the memory-centric architecture of HDC,
duces an n-gram hypervector through a binding operation given by because it allows us to perform the main computations on the IM
and AM within the memory units with a high degree of parallel-
Gðs½1; s½2; ; s½nÞ ¼ B½1ρðB½2Þ ρn�1 ðB½nÞ ð1Þ ism. Furthermore, the IM and AM are only programmed once while
training on a specific dataset, and the two types of in-memory com-
where B[k] corresponds to the associated basis hypervector for putation that are employed involve just read operations. Therefore,
symbol s[k]. The operator denotes the XNOR and ρ denotes a non-volatile memristive devices are very well suited for implement-
pseudo-random permutationI operation, for example, a circular shift ing the IM and AM, and only binary conductance states are required.
Item memory
s1 → B1
B1
sh → Bh
Row decoder
Item hypervector (B[k]) B2
d bits
n-gram
Bh
encoding
SA SA SA
n-gram hyper- d bits
vector (G) Peripheral
Bundler
Query hypervector (Q) /
d bits
prototype hypervector (P)
Wordline driver
Associative memory
Query hypervector (Q)
Class1 → P1
P1
Classc → Pc
Peripheral
P2
AD converters
(1) (2) (d )
Distance computation
Pc
Predicted class
Fig. 1 | The concept of in-memory HDC. A schematic of the concept of in-memory HDC showing the essential steps associated with HDC (left)
and how they are realized using in-memory computing (right). An item memory (IM) stores h, d-dimensional basis hypervectors that correspond
to the symbols associated with a classification problem. During learning, based on a labelled training dataset, an encoder performs dimensionality,
preserving mathematical manipulations on the basis hypervectors to produce c, d-dimensional prototype hypervectors that are stored in an AM. During
classification, the same encoder generates a query hypervector based on a test example. Subsequently, an AM search is performed between the query
hypervector and the hypervectors stored in the AM to determine the class to which the test example belongs. In in-memory HDC, both the IM and AM
are mapped onto crossbar arrays of memristive devices. The mathematical operations associated with encoding and AM search are performed in place
by exploiting in-memory read logic and dot-product operations, respectively. A dimensionality of d = 10,000 is used. SA, sense amplifier; AD converters,
analog-to-digital converters.
In this work, we have used PCM technology34,35, which operates The AM search module
by switching a phase-change material between amorphous (high Classification involves an AM search between the prototype hyper-
resistivity) and crystalline (low resistivity) phases to implement vectors and the query hypervector using a suitable similarity metric,
binary data storage (see Methods). PCM has also been successfully such as the inverse Hamming distance (invHamm) computed from
employed in novel computing paradigms such as neuromorphic equation (2). Using associativity of addition operations, the expres-
computing36–40 and computational memory20,22,41,42, which makes it a sion in equation (2) can be decomposed into the addition of two
good candidate for realizing the in-memory HDC system. dot-product terms as shown in equation (3):
In the remaining part of this Article, we will elaborate the
detailed designs of the AM, the encoder and finally propose a ClassPred ¼ arg max Q Pi þ Q Pi
complete in-memory HDC system that achieves a near-optimum i2f1;:::;cg
trade-off between design complexity and output accuracy. The ð3Þ
’ arg max Q Pi
functionality of the in-memory HDC system will be validated i2f1;:::;cg
through experiments using a prototype PCM chip fabricated
in 90 nm CMOS technology (see Methods), and a complete where Q denotes the logical complement of Q. Because the opera-
system-level design implemented using 65 nm CMOS technology tions associated
I with HDC ensure that both the query and proto-
will be presented. type hypervectors have an almost equal number of zeros and ones,
Pc (c)
P33
Gate enable
AD converters
2
P3 P 3
P2 (2)
P13
+
(1) (2) (d )
+
r3
P1
+
(1)
P 32
TA circuit
+
Label
+
P 2 P 22
WTA
+
Pc (c)
TA circuit
+
r2
+
P 12
To WTA
Gate enable
AD converters
+
P2 (2) P 31
+
(1) (2) (d)
r1
+
P1 (1) P 1 P 21
+
Q dotp
p only paths P 11
c
Sim., f = 1 Sim., f = 2 Sim., f = 10 PCM chip, f = 10 Software
Language News EMG
100
96.9 96.6 97.0 97.6 97.7 98.0 98.0 98.0 98.3 98.5 98.5 98.4 98.5
95.5 94.9 96.2
Accuracy (%)
88
84 82.9
82.0
80
dotp
t i vH
invHamm d tp
dot i vH
invHamm d tp
dot i vH
invHamm
Fig. 2 | AM search. a, Schematic of the AM search architecture to compute the invHamm similarity metric. Two PCM crossbar arrays of c rows and d
columns are employed. b, Schematic of the coarse-grained randomization strategy employed to counter the variations associated with the crystalline PCM
state. c, Results of the classification task show that the experimental on-chip accuracy results compare favourably with the 10-partition simulation results
and software baseline for both similarity metrics on the three datasets.
the dot product (dotp) arg maxi2f1;:::;cg Q Pi can also serve as a via- classification and hand gesture recognition from electromyogra-
ble similarity metric. I phy (EMG) signals (see Methods). These tasks demand a generic
To compute the invHamm similarity metric, two memristive programmable architecture to support different numbers of inputs,
crossbar arrays of c rows and d columns are required, as shown in classes and data types (see Methods). In the experiments, the pro-
Fig. 2a. The prototype hypervectors, Pi, are programmed into one of totype hypervectors (and their complements) are learned before-
the crossbar arrays as conductance states. Binary ‘1’ components are hand in software and are then programmed into the PCM devices
programmed as crystalline states and binary ‘0’ components are pro- on the chip. Inference is then performed with a software encoder
grammed as amorphous states. The complementary hypervectors Pi and using equation (3) for the AM search, in which all multiplica-
are programmed in a similar manner into the second crossbar array.I tion operations are performed in the analog domain (by exploiting
The query hypervector Q and its complement Q are applied as volt- Ohm’s law) on chip and the remaining operations are implemented
age values along the wordlines of the respectiveIcrossbars. In accor- in software (see Methods and Supplementary Note 1). The software
dance with Kirchoff ’s current law, the total current on the ith bitline encoder was employed to precisely assess the performance and
will be equal to the dot product between the query hypervector accuracy of the AM search alone when implemented in hardware.
and the ith prototype hypervector. The results of these in-memory The in-memory encoding scheme and its experimental validation
dot-product operations from the two arrays are added in a pairwise are presented in sections ‘The n-gram encoding module’ and ‘The
manner using a digital adder circuitry in the periphery and are sub- complete in-memory HDC system’.
sequently input to a winner-take-all (WTA) circuit that outputs a ‘1’ Although HDC is remarkably robust to random variability and
only on the bitline corresponding to the class of maximum similarity device failures, deterministic spatial variations in the conductance
value. When the dotp similarity metric is considered, only the cross- values could pose a challenge. Unfortunately, in our prototype PCM
bar encoding Pi is used and the array of adders in the periphery is chip, the conductance values associated with the crystalline state
eliminated, resulting in reduced hardware complexity. do exhibit a deterministic spatial variation (Supplementary Note
Experiments were performed using a prototype PCM chip 2). However, given the holographic nature of the hypervectors,
to evaluate the effectiveness of the proposed implementation on this can be easily addressed by a random partitioning approach.
three common HDC benchmarks: language classification, news We employed a coarse-grained randomization strategy, where the
a
Original IM crossbar Complementary IM crossbar
(1) (2) (d ) (d ) (2) (1)
B1 B1
s[k]
k B2 B2 s[k]
Bh Bh
SA SA SA SA SA SA
Start
Minterm buffers
(1) ( )
(2) (d) (d ) (2) (1)
b
Sim., f = 10 Software
Language News
100
96.96 96.99
95.94 95.97
Accuracy (%)
84
80
2-minterm All-minterm 2-minterm All-minterm 2-minterm All-minterm 2-minterm All-minterm
d tp
dot i vH
invHamm d tp
dot i vH
invHamm
Fig. 3 | In-memory n-gram encoding based on 2-minterm. a, The basis hypervectors and their complements are mapped onto two crossbar arrays.
Through a sequence of in-memory logical operations, the approximated n-gram G, as in equation (5), is generated. b, Classification results on the language
(using n = 4) and news (using n = 5) datasets, showing the performance of the 2-minterm approximation compared with the all-minterm approach.
partition that is selected, the relevant gates are activated through a Finally, to benchmark the performance of the system in terms
controller sitting inside the AM search module. Finally, the results of energy consumption, the digital submodules in the system-level
in the sum buffer are sent through WTA circuitry to find the maxi- architecture (marked with dotted boundaries in Fig. 4a) that fall
mum index that provides the prediction. outside the PCM crossbars arrays were synthesized using 65 nm
To experimentally validate the functionality of the complete CMOS technology. The synthesis results for these modules were
in-memory HDC architecture, we chose to implement the infer- combined with the performance characteristics of PCM crossbar
ence operation, which comprises both encoding (to generate the arrays to evaluate the energy, area and throughput of the full system
query hypervectors) and AM search (Supplementary Video 1). (see Methods). Furthermore, PCM crossbar sections were imple-
For faster experiments, we trained our HDC model in software mented in CMOS distributed standard cell registers with associ-
using the 2-minterm approximate encoding method described in ated multiplier–adder tree logic and binding logic for AM and IM,
the section ‘The n-gram encoding module’, which could be per- respectively, to construct a complete CMOS HD processor to com-
formed as well with our proposed in-memory HDC architecture. pare with the proposed PCM crossbar-based architecture.
This software generates the hypervectors for AM from a given data- A comparison of the performance between the all-CMOS
set. Subsequently, the components of all hypervectors of both IM approach and the PCM crossbar-based approach is presented in
and AM were programmed on individual hardware PCM devices, Table 1. A 6.01× improvement in total energy efficiency and 3.74×
and the inference operation was implemented leveraging the two reduction in area is obtained with the introduction of the PCM
in-memory computing primitives (for both 2-minterm encoding crossbar modules. The encoder’s energy expense for processing a
and the AM search) using the prototype PCM chip (see Methods query reduces by a factor of 3.50 with the PCM crossbar implemen-
and Supplementary Note 1). Figure 4b summarizes the accuracy tation, whereas that of the AM search module reduces by a factor of
results with software, the PCM statistical model and the on-chip 117.5. However, these efficiency factors are partially masked by the
experiment for the language and news classification benchmarks. CMOS peripheral circuitry that is common to both implementa-
Compared with the previous experiment, where only AM was con- tions, specifically that in the encoder module, which accounts for
tained on-chip, the full chip experiment results show a similar accu- the majority of its energy consumption. When peripheral circuits
racy level, indicating the minimal effect on accuracy when porting are ignored and only the parts of the design that are exclusive to
the IM into PCM devices with in-memory n-gram encoding. each approach are directly compared to each other, 14.4× and 334×
Furthermore, the accuracy level reported in this experiment is close energy savings and 24.5× and 31.9× area savings are obtained for
to the accuracy reported with the software for the same parametric the encoder and AM search module, respectively. It remains part
configuration of the HD inference model. of future work to investigate methods in which peripheral modules
Index
buffer
s[k]
k s[k]
k
SA SA SA SA SA SA
Ngram_shift Minterm
buffers
(1) (2) (d) (d) (2) (1)
Ngram_start
Sum hypervector
< < < Bundler <
Query_end
Encoder Query_valid
(1) (2) (3) (d ) Query hypervector
Original IM
crossbar (1)
ADC
Complementary (2)
Partition
r i r1 (1)
IM crossbar
AM (2)
Sum buffer
AM search (c)
c
crossbar search
WTA
T
AM search
partitions (c)
c
(1)
ADC
AD interfaces Partition
a t rf (2)
Class
Control path Partition_select index
(c)
c
Digital data path
b
Language News
100
95.16 94.94
Accuracy (%)
96 92.83 92.81
92
87.85 87.17 87.57 87.3
88
84
80
Software Simulat
a ion AM on AM+IM on Software Simulat
a ion AM on AM+IM on
PCM chip PCM chip PCM chip PCM chip
Fig. 4 | The complete in-memory HDC system. a, Schematic of the architecture, showing the 2-minterm encoder and AM search engine employing the
dotp metric. b, The classification accuracy results on the news and language datasets, where both the encoding and AM search are performed in software,
simulated using the PCM model and are experimentally realized on the chip.
are designed more energy efficiently so that the overall system effi- in-memory with logical and dot-product operations on memristive
ciency can be improved further. devices. Due to the inherent robustness of HDC to errors, it was
possible to approximate the mathematical operations associated
Conclusions with HDC to make it suitable for hardware implementation, and to
HDC is a brain-inspired computational framework that is par- use analog in-memory computing without significantly degrading
ticularly well suited for the emerging computational paradigm of the output accuracy. Our architecture is programmable to support
in-memory computing. We have reported a complete in-memory different hypervector representations, dimensionality and number
HDC system whose two main components are an encoder and of input symbols and output classes to accommodate a variety of
an AM search engine. The main computations are performed applications.
Table 1 | Performance comparison between a dedicated all-CMOS implementation and in-memory HDC with PCM crossbars
All-CMOS PCM crossbar based
Encoder AM search Total Encoder AM search Total
Energy
Average energy per query (nJ) 1,474 1,110 2,584 420.8 9.44 430.3
Improvement 3.50× 117.5× 6.01×
Exclusive modules avg. energy per query (nJ) 1,132 1,104 2,236 78.60 3.30 81.90
Improvement 14.40× 334.6× 27.30×
Area
Total area (mm2) 4.77 2.99 7.76 1.39 0.68 2.07
Improvement 3.43× 4.38× 3.74×
Exclusive modules area (mm2) 3.53 2.38 5.91 0.14 0.075 0.22
Improvement 24.57× 31.94× 27.09×
Vread generated via an off-chip DAC. The sensed current, Iread, was integrated by a
Hardware/software experiments using a prototype PCM chip capacitor, and the resulting voltage was then digitized by the on-chip 8 bit cyclic
delivered accuracies comparable to software baselines on language ADC. The total time of one read was 1 μs. To program a PCM cell, a voltage Vprog
and news classification benchmarks with 10,000-dimensional generated off chip was converted on chip into a programming current, Iprog. This
hypervectors. These experiments used hardware PCM devices to current was then mirrored into the selected bitline for the desired duration of the
implement both in-memory encoding and AM search, thus dem- programming pulse. The pulse used to program the PCM to the amorphous state
(reset) was a box-type rectangular pulse with duration of 400 ns and amplitude
onstrating the hardware functionality of all the operations involved of 450 μA. The pulse used to program the PCM to the crystalline state (set) was
in a generic HDC processor for learning and inference. A compara- a ramp-down pulse with total duration of ~12 μs. The access-device gate voltage
tive study performed against a system-level design implemented (wordline voltage) was kept high at 2.75 V during the programming pulses. These
using 65 nm CMOS technology showed that the in-memory HDC programming conditions were optimized to achieve the highest on/off ratio and to
approach could result in more than 6× end-to-end savings in energy. minimize device-to-device variability for binary storage.
By designing more energy-efficient peripheral circuits and with the Datasets to evaluate in-memory HDC. We targeted three highly relevant learning
potential of scaling PCM devices to nanoscale dimensions43, these and classification tasks to evaluate the proposed in-memory HDC architecture.
gains could increase several-fold. The in-memory HDC concept is These tasks demand a generic programmable architecture to support different
also applicable to other types of memristive device based on ionic numbers of inputs, classes and data types, as shown in Extended Data Table 1. In
the following, we describe these tasks that are used to benchmark the performance
drift44 and magnetoresistance45. Future work will focus on taking
of in-memory HDC in terms of classification accuracy.
in-memory HDC beyond learning and classification to perform
advanced cognitive tasks alongside data compression and retrieval 1. Language classification: in this task, HDC is applied to classify raw text
composed of Latin characters into their respective language48. The train-
on dense storage devices, as well as building more power-efficient ing texts are taken from the Wortschatz Corpora49, where large numbers
peripheral hardware to harness the best of in-memory computing. of sentences (about a million bytes of text) are available for 22 European
languages. Another independent dataset, Europarl Parallel Corpus50, with
Methods 1,000 sentences per language, is used as the test dataset for the classification.
PCM-based hardware platform. The experimental hardware platform is built The former database is used for training 22 prototype hypervectors for each
around a prototype PCM chip that contains PCM cells based on doped-Ge2Sb2Te5 of the languages while the latter is used to run inference on the trained HDC
(d-GST) that are integrated into the prototype chip in 90 nm CMOS baseline model. For subsequent simulations and experiments with the language data-
technology. In addition to the PCM cells, the prototype chip integrates the circuitry set, we use dimensionality d = 10,000 and n-gram size n = 4. We use an IM of
for cell addressing, on-chip ADCs for cell readout and voltage- or current-mode 27 symbols, representing the 26 letters of the Latin alphabet plus a whitespace
cell programming. The experimental platform comprises the following main units: character. Training is performed using the entire training dataset, containing
a labelled text of 120,000–240,000 words per language. For inference, a query
• a high-performance analog-front-end (AFE) board that contains
is composed of a single sentence of the test dataset, so, in total, 1,000 queries
digital-to-analog converters (DACs) along with discrete electronics, such as
per language are used.
power supplies, voltage and current reference sources
2. News classification: the news dataset comprises a database of Reuters news
• a field-programmable gate array (FPGA) board that implements the data
articles, subjected to a lightweight pre-processing step, covering eight differ-
acquisition and digital logic to interface with the PCM device under test and
ent news genres51. The pre-processing step removes frequent ‘stop’ words and
with all the electronics of the AFE board
words with fewer than three letters. The training set has 5,400+ documents,
• a second FPGA board with an embedded processor and Ethernet connection
while the testing set contains 2,100+ documents. For subsequent simulations
that implements the overall system control and data management as well as
and experiments with the news dataset, we use dimensionality d = 10,000 and
the interface with the host computer
n-gram size n = 5, as suggested in ref. 18. Similar to the language task, we use
The prototype chip46 contains three million PCM cells, as well as the CMOS an IM of 27 symbols, representing the 26 letters of the Latin alphabet plus a
circuitry to address, program and read out any of these three million cells. In the whitespace character. Training is performed using the entire training dataset,
PCM devices used for experimentation, two 240-nm-wide access transistors were where all labelled documents pertaining to the same class are merged into
used in parallel per PCM element (cell size of 50 F2). The PCM array is organized a single text. This merged text contains 8,000–200,000 words per class. For
as a matrix of 512 wordlines and 2,048 bitlines. The PCM cells were integrated inference, a query is composed of a single document of the test dataset.
into the chip in 90 nm CMOS technology using the keyhole process47. The bottom 3. Hand gesture recognition from EMG signals: in this task, we focus on use
electrode had a radius of ~20 nm and length of ~65 nm. The phase-change material of HDC in a smart prosthetic application, namely hand gesture recognition
was ~100 nm thick and extended to the top electrode, whose radius was ~100 nm. from a stream of EMG signals. A database52 that provides EMG samples
The selection of one PCM cell was performed by serially addressing a wordline recorded from four channels covering the forearm muscles is used for this
and a bitline. The addresses were decoded and drove the wordline driver and benchmark. Each channel data is quantized into 22 intensity levels of electric
the bitline multiplexer. The single selected cell could be programmed by forcing potential. The sampling frequency of the EMG signal is 500 Hz.
a current through the bitline with a voltage-controlled current source. It could A label is provided for each time sample. The label varies from one to five cor-
also be read by an 8 bit on-chip ADC. To read a PCM cell, the selected bitline responding to five classes of performed gestures. This dataset is used to train
was biased to a constant voltage of 300 mV by a voltage regulator via a voltage an HDC model to detect hand gestures of a single subject. For training on the
Extended Data Table 1 | Architecture configurations and hyperparameters used for the tree different tasks