0% found this document useful (0 votes)
20 views

In-Memory Hyperdimensional Computing

This article proposes an in-memory hyperdimensional computing system using memristive crossbar arrays. The system performs all hyperdimensional computing operations like binding and bundling using memristive devices and digital logic. This eliminates repeated writing to the devices. Two crossbar arrays are used, one for binding and the other for associative memory search. The system achieves comparable accuracy to software implementations on three machine learning tasks like language classification from text. It provides an efficient hardware implementation of hyperdimensional computing using emerging memory technologies.

Uploaded by

고재형
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

In-Memory Hyperdimensional Computing

This article proposes an in-memory hyperdimensional computing system using memristive crossbar arrays. The system performs all hyperdimensional computing operations like binding and bundling using memristive devices and digital logic. This eliminates repeated writing to the devices. Two crossbar arrays are used, one for binding and the other for associative memory search. The system achieves comparable accuracy to software implementations on three machine learning tasks like language classification from text. It provides an efficient hardware implementation of hyperdimensional computing using emerging memory technologies.

Uploaded by

고재형
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Articles

https://doi.org/10.1038/s41928-020-0410-3

In-memory hyperdimensional computing


Geethan Karunaratne   1,2, Manuel Le Gallo   1, Giovanni Cherubini   1, Luca Benini2,
Abbas Rahimi   2 ✉ and Abu Sebastian   1 ✉

Hyperdimensional computing is an emerging computational framework that takes inspiration from attributes of neuronal cir-
cuits including hyperdimensionality, fully distributed holographic representation and (pseudo)randomness. When employed
for machine learning tasks, such as learning and classification, the framework involves manipulation and comparison of large
patterns within memory. A key attribute of hyperdimensional computing is its robustness to the imperfections associated with
the computational substrates on which it is implemented. It is therefore particularly amenable to emerging non-von Neumann
approaches such as in-memory computing, where the physical attributes of nanoscale memristive devices are exploited to
perform computation. Here, we report a complete in-memory hyperdimensional computing system in which all operations are
implemented on two memristive crossbar engines together with peripheral digital complementary metal–oxide–semiconductor
(CMOS) circuits. Our approach can achieve a near-optimum trade-off between design complexity and classification accuracy
based on three prototypical hyperdimensional computing-related learning tasks: language classification, news classification
and hand gesture recognition from electromyography signals. Experiments using 760,000 phase-change memory devices per-
forming analog in-memory computing achieve comparable accuracies to software implementations.

B
iological computing systems trade accuracy for efficiency. reasoning14. In the field of robotics, HDC has been employed for
Thus, one solution to reduce energy consumption in artifi- learning sensorimotor control for active perception in robots10.
cial systems is to adopt computational approaches that are In traditional forms of computing, HDC has been proposed for
inherently robust to uncertainty. Hyperdimensional computing efficient representation of structured information15 as well as the
(HDC) is one such framework and is based on the observation synthesis and execution of finite state automata16 and variants of
that key aspects of human memory, perception and cognition can recurrent neural networks17.
be explained by the mathematical properties of hyperdimensional HDC begins by representing symbols with i.i.d. hypervectors
spaces comprising high-dimensional binary vectors known as that are combined by nearly i.i.d.-preserving operations, namely
hypervectors. Hypervectors are defined as d-dimensional (where binding, bundling and permutation, and then stored in associative
d ≥ 1,000) (pseudo)random vectors with independent and identi- memories (AMs) to be recalled, matched, decomposed or reasoned
cally distributed (i.i.d.) components1. When the dimensionality is about. This chain implies that failure in a component of a hypervec-
in the thousands, a large number of quasi-orthogonal hypervec- tor is not contagious and forms a computational framework that is
tors exist. This allows HDC to combine such hypervectors into new intrinsically robust to defects, variations and noise18. The manipula-
hypervectors using well-defined vector space operations, defined tion of large patterns stored in memory and its inherent robustness
such that the resulting hypervector is unique, and with the same make HDC particularly well suited to emerging computing para-
dimension. A powerful system of computing can be built on the rich digms such as in-memory computing or computational memory
algebra of hypervectors2. Groups, rings and fields over hypervectors based on emerging nanoscale resistive memory or memristive
become the underlying computing structures with permutations, devices19–23. In one such work, a 3D vertical resistive random access
mappings and inverses as primitive computing operations. memory (ReRAM) device was used to perform individual opera-
In recent years, HDC has been employed in a range of applica- tions for HDC24,25. In another work, a carbon-nanotube field-effect
tions, including machine learning, cognitive computing, robotics transistor-based logic layer was integrated with ReRAMs, improv-
and traditional computing. It has shown significant promise in ing efficiency further26. However, these architectures offered only
machine learning applications that involve temporal patterns, such limited applications such as a single language recognition task24,26 or
as text classification3, biomedical signal processing4,5, multimodal a restricted binary classification version of the same task26, and their
sensor fusion6 and distributed sensors7,8. A key advantage is that evaluation is based on simulations and compact models derived
the training algorithm in HDC works in one or few shots: that is, from small prototypes with only 256 ReRAM cells24 or a small 32 bit
object categories are learned from one or few examples, and in a datapath for hypervector manipulations that results in three orders
single pass over the training data as opposed to many iterations. In of magnitude higher latency overhead26.
the highlighted machine learning applications, HDC has achieved In this Article, we report a complete integrated in-memory HDC
similar or higher accuracy with fewer training examples compared system in which all the operations of HDC are implemented on two
to support vector machines (SVMs)4, extreme gradient boosting9 planar memristive crossbar engines together with peripheral digital
and convolutional neural networks (CNNs)10, and lower execu- CMOS circuits. We devise a way of performing hypervector binding
tion energy on embedded CPU/GPUs compared to SVMs11, CNNs entirely within a first memristive crossbar using an in-memory read
and long short-term memory5. Applications of HDC in cognitive logic operation and hypervector bundling near the crossbar with
computing include solving Raven’s progressive matrices12, func- CMOS logic. These key operations of HDC cooperatively encode
tional imitation of concept learning in honey bees13 and analogical hypervectors with high precision, while eliminating the need to

1
IBM Research – Zurich, Rüschlikon, Switzerland. 2Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland.
✉e-mail: [email protected]; [email protected]

Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics 327


Articles NATurE ElECTrOnICS

repeatedly program (write) the memristive devices. In contrast, by 1 bit. The encoder then bundles several such n-gram hypervec-
previous work on HDC using memristive devices did not employ tors from the training data using component-wise addition followed
in-memory logic operations for binding; instead, a ReRAM-based by a binarization (majority function) to produce a prototype hyper-
XOR lookup table24 or digital logic26 was used. Moreover, the pre- vector for the given class. The overall encoding operation results in
vious in-memory compute primitives for permutation24 and bun- c, d-dimensional prototype hypervectors (referred to as associative
dling26 resulted in repeated programming of the memristive devices, memory (AM)) assuming there are c classes.
which is prohibitive given the limited cycling endurance. When inference or classification is performed, a query hyper-
In our architecture, an AM search is performed using a second vector (for example, from a text of unknown language) is gener-
memristive crossbar for in-memory dot-product operations on the ated identically to the way the prototype hypervectors are generated.
encoded output hypervectors from the first crossbar, realizing the full Subsequently, the query hypervector is compared with the proto-
HDC system functionality. Our combination of analog in-memory type hypervectors inside the AM to make the appropriate classifica-
computing with CMOS logic allows continual functioning of the tion. Equation (2) defines how a query hypervector Q is compared
memristive crossbars with desired accuracy for a wide range of against each of the prototype hypervector Pi out of c classes to
multiclass classification tasks. We verify the integrated inference find the predicted class with maximum similarity. This AM search
functionality of the system through large-scale mixed hardware/ operation can, for example, be performed by calculating the inverse
software experiments, in which up to 49 d = 10,000-dimensional Hamming distance:
hypervectors are encoded in 760,000 hardware phase-change mem-
ory (PCM) devices performing analog in-memory computing. Our d
X
experiments achieve comparable accuracies to the software baselines ClassPred ¼ arg max QðjÞPi ðjÞ ð2Þ
and surpass those reported in previous work on an emulated small i2f1;:::;cg j¼1
ReRAM crossbar24. Furthermore, a complete system-level design of
the in-memory HDC architecture synthesized using 65 nm CMOS One key observation is that the two main operations presented
technology demonstrates >6× end-to-end reductions in energy above, namely the encoding and AM search, are about manipu-
compared with a dedicated digital CMOS implementation. With lating and comparing large patterns within the memory. Both IM
our approach, we map all operations of HDC either in-memory or and AM (after learning) represent permanent hypervectors stored
near-memory and demonstrate their integrated functionality for in the memory. As a lookup operation, different input symbols
three specific machine learning related tasks. activate the corresponding stored patterns in the IM that are then
combined inside or around memory with simple local opera-
The concept of in-memory HDC tions to produce another pattern for comparison in AM. These
When HDC is used for learning and classification, a set of i.i.d., component-wise arithmetic operations on patterns allow a high
hence quasi-orthogonal hypervectors, referred to as basis hypervec- degree of parallelism as each hypervector component needs to
tors, are first selected to represent each symbol associated with a communicate with only a local component or its immediate neigh-
dataset. For example, if the task is to classify an unknown text into bours. This highly memory-centric aspect of HDC is the key
the corresponding language, the symbols could be the letters of the motivation for the in-memory computing implementation pro-
alphabet. The basis hypervectors stay fixed throughout the compu- posed in this work.
tation. Assuming that there are h symbols, fsi gh1 , the set of the h, The essential idea of in-memory HDC is to store the compo-
d-dimensional basis hypervectors fBi gh1 is referred
I to as the item nents of both the IM and the AM as the conductance values of
memory (IM) (Fig. 1). Basis hypervectors
I serve as the basis from nanoscale memristive devices29,30 organized in crossbar arrays and
which further representations are made by applying a well-defined enable HDC operations in or near to those devices (Fig. 1). The
set of component-wise operations: addition of binary hypervec- IM of h rows and d columns is stored in the first crossbar, where
tors [+] is defined as the component-wise majority, multiplica- each basis hypervector is stored on a single row. To perform 
tion (⊕) is defined as the component-wise exclusive-OR (XOR) operations between the basis hypervectors for the n-gram encod-I
(or equivalently as the component-wise exclusive-NOR (XNOR)) ing, an in-memory read logic primitive is employed. Unlike the
and permutation (ρ) is defined as a pseudo-random shuffling of majority of reported in-memory logic operations31–33, the proposed
the coordinates. Applied on dense binary hypervectors where each in-memory read logic is non-stateful and this obviates the need for
component has equal probability of being zero or one27, all these high write endurance of the memristive devices. Additional periph-
operations produce a d-bit hypervector resulting in a closed system. eral circuitry is used to implement the remaining permutations
Subsequently, during the learning phase, the basis hypervectors and component-wise additions needed in the encoder. The AM
in the IM are combined with the component-wise operations inside of c rows and d columns is implemented in the second crossbar,
an encoder to compute, for example, a quasi-orthogonal n-gram where each prototype hypervector is stored on a single row. During
hypervector representing an object of interest28, and to add n-gram supervised learning, each prototype hypervector output from the
hypervectors from the same category of objects to produce a pro- first crossbar is programmed into a certain row of the AM based on
totype hypervector representing the entire class of category. In the the provided label. During inference, the query hypervector output
language example, the encoder would receive input text associated from the first crossbar is input as voltages on the wordline driver, to
with a known language and would generate a prototype hypervec- perform the AM search using an in-memory dot product primitive.
tor corresponding to that language. In this case, n determines the Because every memristive device in the AM and IM is reprogram-
smallest number of symbols (letters in the example) that are com- mable, the representation of hypervectors is not hardcoded, unlike
bined while performing an n-gram encoding operation. When the refs. 24–26, which used device variability for projection.
encoder receives n consecutive symbols, {s[1], s[2], …, s[n]}, it pro- This design ideally fits the memory-centric architecture of HDC,
duces an n-gram hypervector through a binding operation given by because it allows us to perform the main computations on the IM
and AM within the memory units with a high degree of parallel-
Gðs½1; s½2;    ; s½nÞ ¼ B½1ρðB½2Þ    ρn�1 ðB½nÞ ð1Þ ism. Furthermore, the IM and AM are only programmed once while
training on a specific dataset, and the two types of in-memory com-
where B[k] corresponds to the associated basis hypervector for putation that are employed involve just read operations. Therefore,
symbol s[k]. The operator  denotes the XNOR and ρ denotes a non-volatile memristive devices are very well suited for implement-
pseudo-random permutationI operation, for example, a circular shift ing the IM and AM, and only binary conductance states are required.

328 Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics


NATurE ElECTrOnICS Articles
s[k-1] s[k]
k s[k+1]
k s[k+2]
k
Symbol (s[k])
Symbol (s[k])

Item memory
s1 → B1
B1
sh → Bh

Row decoder
Item hypervector (B[k]) B2
d bits

Encoder (1) (2) (d )

n-gram
Bh
encoding
SA SA SA
n-gram hyper- d bits
vector (G) Peripheral

Bundler
Query hypervector (Q) /
d bits
prototype hypervector (P)

d bits Prototype hypervector (P)

Wordline driver
Associative memory
Query hypervector (Q)
Class1 → P1
P1
Classc → Pc

d bits Gate controller

Peripheral
P2

AD converters
(1) (2) (d )
Distance computation

Pc
Predicted class

Inference-only data paths Memory unit Predicted class


Training-only data paths Computing unit
Common data paths In-memory computing unit

Fig. 1 | The concept of in-memory HDC. A schematic of the concept of in-memory HDC showing the essential steps associated with HDC (left)
and how they are realized using in-memory computing (right). An item memory (IM) stores h, d-dimensional basis hypervectors that correspond
to the symbols associated with a classification problem. During learning, based on a labelled training dataset, an encoder performs dimensionality,
preserving mathematical manipulations on the basis hypervectors to produce c, d-dimensional prototype hypervectors that are stored in an AM. During
classification, the same encoder generates a query hypervector based on a test example. Subsequently, an AM search is performed between the query
hypervector and the hypervectors stored in the AM to determine the class to which the test example belongs. In in-memory HDC, both the IM and AM
are mapped onto crossbar arrays of memristive devices. The mathematical operations associated with encoding and AM search are performed in place
by exploiting in-memory read logic and dot-product operations, respectively. A dimensionality of d = 10,000 is used. SA, sense amplifier; AD converters,
analog-to-digital converters.

In this work, we have used PCM technology34,35, which operates The AM search module
by switching a phase-change material between amorphous (high Classification involves an AM search between the prototype hyper-
resistivity) and crystalline (low resistivity) phases to implement vectors and the query hypervector using a suitable similarity metric,
binary data storage (see Methods). PCM has also been successfully such as the inverse Hamming distance (invHamm) computed from
employed in novel computing paradigms such as neuromorphic equation (2). Using associativity of addition operations, the expres-
computing36–40 and computational memory20,22,41,42, which makes it a sion in equation (2) can be decomposed into the addition of two
good candidate for realizing the in-memory HDC system. dot-product terms as shown in equation (3):
In the remaining part of this Article, we will elaborate the
detailed designs of the AM, the encoder and finally propose a ClassPred ¼ arg max Q  Pi þ Q  Pi
complete in-memory HDC system that achieves a near-optimum i2f1;:::;cg
trade-off between design complexity and output accuracy. The ð3Þ
’ arg max Q  Pi
functionality of the in-memory HDC system will be validated i2f1;:::;cg
through experiments using a prototype PCM chip fabricated
in 90  nm CMOS technology (see Methods), and a complete where Q denotes the logical complement of Q. Because the opera-
system-level design implemented using 65 nm CMOS technology tions associated
I with HDC ensure that both the query and proto-
will be presented. type hypervectors have an almost equal number of zeros and ones,

Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics 329


Articles NATurE ElECTrOnICS
a b Q
Q Q3 Q2 Q1

Pc (c)
P33

Gate enable

AD converters
2
P3 P 3

P2 (2)
P13

+
(1) (2) (d )

+
r3
P1

+
(1)
P 32

TA circuit

+
Label

+
P 2 P 22

WTA

+
Pc (c)

TA circuit
+
r2

+
P 12

To WTA
Gate enable

AD converters

+
P2 (2) P 31

+
(1) (2) (d)
r1

+
P1 (1) P 1 P 21

+
Q dotp
p only paths P 11

c
Sim., f = 1 Sim., f = 2 Sim., f = 10 PCM chip, f = 10 Software
Language News EMG
100
96.9 96.6 97.0 97.6 97.7 98.0 98.0 98.0 98.3 98.5 98.5 98.4 98.5
95.5 94.9 96.2
Accuracy (%)

96 93.9 93.6 93.6 .7 92.5 93.0


92.6 .7 .5 93.0
92 91.4 91.3

88
84 82.9
82.0
80
dotp
t i vH
invHamm d tp
dot i vH
invHamm d tp
dot i vH
invHamm

Fig. 2 | AM search. a, Schematic of the AM search architecture to compute the invHamm similarity metric. Two PCM crossbar arrays of c rows and d
columns are employed. b, Schematic of the coarse-grained randomization strategy employed to counter the variations associated with the crystalline PCM
state. c, Results of the classification task show that the experimental on-chip accuracy results compare favourably with the 10-partition simulation results
and software baseline for both similarity metrics on the three datasets.

the dot product (dotp) arg maxi2f1;:::;cg Q  Pi can also serve as a via- classification and hand gesture recognition from electromyogra-
ble similarity metric. I phy (EMG) signals (see Methods). These tasks demand a generic
To compute the invHamm similarity metric, two memristive programmable architecture to support different numbers of inputs,
crossbar arrays of c rows and d columns are required, as shown in classes and data types (see Methods). In the experiments, the pro-
Fig. 2a. The prototype hypervectors, Pi, are programmed into one of totype hypervectors (and their complements) are learned before-
the crossbar arrays as conductance states. Binary ‘1’ components are hand in software and are then programmed into the PCM devices
programmed as crystalline states and binary ‘0’ components are pro- on the chip. Inference is then performed with a software encoder
grammed as amorphous states. The complementary hypervectors Pi and using equation (3) for the AM search, in which all multiplica-
are programmed in a similar manner into the second crossbar array.I tion operations are performed in the analog domain (by exploiting
The query hypervector Q and its complement Q are applied as volt- Ohm’s law) on chip and the remaining operations are implemented
age values along the wordlines of the respectiveIcrossbars. In accor- in software (see Methods and Supplementary Note 1). The software
dance with Kirchoff ’s current law, the total current on the ith bitline encoder was employed to precisely assess the performance and
will be equal to the dot product between the query hypervector accuracy of the AM search alone when implemented in hardware.
and the ith prototype hypervector. The results of these in-memory The in-memory encoding scheme and its experimental validation
dot-product operations from the two arrays are added in a pairwise are presented in sections ‘The n-gram encoding module’ and ‘The
manner using a digital adder circuitry in the periphery and are sub- complete in-memory HDC system’.
sequently input to a winner-take-all (WTA) circuit that outputs a ‘1’ Although HDC is remarkably robust to random variability and
only on the bitline corresponding to the class of maximum similarity device failures, deterministic spatial variations in the conductance
value. When the dotp similarity metric is considered, only the cross- values could pose a challenge. Unfortunately, in our prototype PCM
bar encoding Pi is used and the array of adders in the periphery is chip, the conductance values associated with the crystalline state
eliminated, resulting in reduced hardware complexity. do exhibit a deterministic spatial variation (Supplementary Note
Experiments were performed using a prototype PCM chip 2). However, given the holographic nature of the hypervectors,
to evaluate the effectiveness of the proposed implementation on this can be easily addressed by a random partitioning approach.
three common HDC benchmarks: language classification, news We employed a coarse-grained randomization strategy, where the

330 Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics


NATurE ElECTrOnICS Articles
idea is to segment the prototype hypervector and to place the result- such as similarity between the basis and prototype hypervec-
ing segments spatially distributed across the crossbar array (Fig. tors (Supplementary Note 3). A schematic illustration of the cor-
2b). This helps all the components of prototype hypervectors to responding n-gram encoding system is presented in Fig. 3a. The
uniformly mitigate long-range variations. The proposed strategy basis hypervectors are programmed on one of the crossbars and
involves dividing the crossbar array into f equal sized partitions (r1, their complement vectors are programmed on the second. The
r2, … rf) and storing a 1/f segment of each of the prototype hyper- component-wise logical AND operation between two hypervectors
vectors (P1, P2, ..., Pc) per partition. Here, f is called the ‘partition in equation (5) is realized in-memory by applying one of the hyper-
factor’ and it controls the granularity associated with the random- vectors as the gate control lines of the crossbar, while selecting the
ization. To match the segments of prototype hypervectors, the query wordline of the second hypervector. The result of the AND function
vector is also split into equally sized subvectors Q1, Q2,..., Qf, which from the crossbar is passed through an array of sense amplifiers to
are input sequentially to the wordline drivers of the crossbar. convert the analog values to binary values. The binary result is then
A statistical model that captures the spatiotemporal conduc- stored in the minterm buffer, whose output is fed back as the gate
tivity variations was used to evaluate the effectiveness of the controls by a single component shift to the right (left in the comple-
coarse-grained randomized partitioning method (Supplementary mentary crossbar). This operation approximates the permutation
Note 2). Simulations were carried out for different partition factors operation in equation (5) as a 1 bit right shift instead of a circular
1, 2 and 10 for the two similarity metrics dotp and invHamm, as 1 bit shift. By performing these operations n times, it is possible to
shown in Fig. 2c. These results indicate that the classification accu- generate the n-gram. After n-gram encoding, the generated n-grams
racy increases with the number of partitions. For example, for lan- are accumulated and binarized with a threshold that depends on n
guage classification, the accuracy improves from 82.5% to 96% with (for details see Methods).
dotp by randomizing with a partition factor of 10 instead of 1. The To test the effectiveness of the encoding scheme with in-memory
experimental on-chip accuracy (performed with a partition factor computing, simulations were carried out using the PCM statisti-
of 10) is close to the 10-partition simulation result and the software cal model. The training was performed in software with the same
baseline for both similarity metrics on all three datasets. When the encoding technique used thereafter for inference, and both the
two similarity metrics are compared, invHamm provides slightly encoder and AM were implemented with modelled PCM cross-
better accuracy for the same partition size, at the expense of almost bars for inference. The simulations were performed only on the
doubled area and energy consumption. Therefore, for low-power language and news classification datasets, because for the EMG
applications, a good trade-off is the use of the dotp similarity metric dataset the hypervectors used for the n-gram encoding are gener-
with a partition factor of 10. ated by a spatial encoding process and cannot be mapped entirely
into a fixed IM of reasonable size. From the results presented in Fig.
The n-gram encoding module 3b, it is clear that the all-minterm approach to encoding provides
In this section, we will focus on the design of the n-gram encod- the best classification accuracy in most configurations of AM, as
ing module. As described in the section ‘The concept of in-memory expected. However, the 2-minterm-based encoding method yields
HDC’, one of the key operations associated with the encoder is cal- a stable and, in some cases, particularly in the language dataset, a
culation of the n-gram hypervector G given by equation (1). To find similar accuracy level to that of the all-minterm approach, while
in-memory hardware-friendly operations, equation (1) is rewrit- significantly reducing the hardware complexity. One of the perceived
ten as the component-wise summation of 2n − 1 minterms given by drawbacks of the 2-minterm approach is the increasing sparsity
equation (4): of the n-gram hypervectors with n. However, it can be shown
that the dot-product similarity between the prototype hyper-
2n�1 � 1 vectors and hence the classification accuracy remain relatively
G¼ _ L1;j ðB½1Þ ^ ρðL2;j ðB½2ÞÞ ^ ¼ ^ ρn�1 ðLn;j ðB½nÞÞ unchanged due to the thresholding operation that depends on n
(Supplementary Note 4).
j¼0
ð4Þ The complete in-memory HDC system
In this section, the complete HDC system and the associated experi-
The operator Lk,j is given by mental results are presented. The proposed architecture comprises
the 2-minterm encoder and dotp similarity metric with a partition
Lk;j ðB½kÞ ¼ B½k if ð�1ÞZðk;jÞ ¼ 1 factor of 10, as this provides the best trade-off between classifica-
¼ B½k otherwise tion accuracy and hardware complexity (Supplementary Note 3). As
1  shown in Fig. 4a, the proposed architecture has three PCM crossbar
where Zðk; jÞ ¼ 2k ð2j þ 2k�1 Þ , k ∈ {1, 2, …, n} is the item hyper- arrays—two with h rows and d columns and one with c × f rows and
vector index
I within an n-gram and j ∈ {0, 1, …, 2n − 1 − 1} is used to d/f columns, with f = 10.
index minterms. The representation given by equation (4) can be The system includes several peripheral circuits—an index buffer,
mapped into memristive crossbar arrays where the bitwise AND (∧) a minterm buffer and a bundler that reside inside the encoder—
function can be realized using an in-memory read logic operation. while the AM search module contains a sum buffer and a compara-
However, the number of minterms (2n − 1 − 1) rises exponentially tor circuit. The index buffer is located at the input of the IM to keep
with the size n of the n-gram, making the hardware computations the indices of the symbols in the sequence and to feed them into
costly. It is thus desirable to reduce the number of minterms and to the crossbar rows. The bundler accumulates the n-gram hypervec-
use a fixed number of minterms independent of n. tors to produce a sum hypervector. Once the threshold is applied
Based on equation (4), we empirically obtained a 2-minterm on the sum hypervector, the result is a prototype hypervector dur-
encoding function for calculating the n-gram hypervector given by ing training or a query hypervector during inference. The controller
inside the encoder module generates control signals according to
^ ¼ ðB½1 ^ ρðB½2Þ ^ ¼ ρn�1 ðB½nÞÞ
G
ð5Þ the n-gram size and the length of the query sequence to allow dif-
_ðB½1 ^ ρðB½2Þ ^ ¼ ρn�1 ðB½nÞÞ ferent configurations of the encoder. During inference, one segment
of the query hypervector at the output buffer of the encoder is fed at
Encoding based on G^ shows mostly functional equivalence with a time to the AM through an array of multiplexers so that only the
I encoding scheme in certain key attributes
the ideal XNOR-based corresponding partition is activated in the AM. Depending on the

Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics 331


Articles NATurE ElECTrOnICS

a
Original IM crossbar Complementary IM crossbar
(1) (2) (d ) (d ) (2) (1)

B1 B1

s[k]
k B2 B2 s[k]

Bh Bh

SA SA SA SA SA SA

Start

Minterm buffers
(1) ( )
(2) (d) (d ) (2) (1)

n-gram hypervector G(1) G(2) G(d)

b
Sim., f = 10 Software

Language News
100
96.96 96.99
95.94 95.97
Accuracy (%)

96 94.96 95.15 94.9 95.16


93.6 93.6 .01 93.01
92
87.84 88.03 87.84
88 87.11

84
80
2-minterm All-minterm 2-minterm All-minterm 2-minterm All-minterm 2-minterm All-minterm
d tp
dot i vH
invHamm d tp
dot i vH
invHamm

Fig. 3 | In-memory n-gram encoding based on 2-minterm. a, The basis hypervectors and their complements are mapped onto two crossbar arrays.
Through a sequence of in-memory logical operations, the approximated n-gram G, as in equation (5), is generated. b, Classification results on the language
(using n = 4) and news (using n = 5) datasets, showing the performance of the 2-minterm approximation compared with the all-minterm approach.

partition that is selected, the relevant gates are activated through a Finally, to benchmark the performance of the system in terms
controller sitting inside the AM search module. Finally, the results of energy consumption, the digital submodules in the system-level
in the sum buffer are sent through WTA circuitry to find the maxi- architecture (marked with dotted boundaries in Fig. 4a) that fall
mum index that provides the prediction. outside the PCM crossbars arrays were synthesized using 65 nm
To experimentally validate the functionality of the complete CMOS technology. The synthesis results for these modules were
in-memory HDC architecture, we chose to implement the infer- combined with the performance characteristics of PCM crossbar
ence operation, which comprises both encoding (to generate the arrays to evaluate the energy, area and throughput of the full system
query hypervectors) and AM search (Supplementary Video 1). (see Methods). Furthermore, PCM crossbar sections were imple-
For faster experiments, we trained our HDC model in software mented in CMOS distributed standard cell registers with associ-
using the 2-minterm approximate encoding method described in ated multiplier–adder tree logic and binding logic for AM and IM,
the section ‘The n-gram encoding module’, which could be per- respectively, to construct a complete CMOS HD processor to com-
formed as well with our proposed in-memory HDC architecture. pare with the proposed PCM crossbar-based architecture.
This software generates the hypervectors for AM from a given data- A comparison of the performance between the all-CMOS
set. Subsequently, the components of all hypervectors of both IM approach and the PCM crossbar-based approach is presented in
and AM were programmed on individual hardware PCM devices, Table 1. A 6.01× improvement in total energy efficiency and 3.74×
and the inference operation was implemented leveraging the two reduction in area is obtained with the introduction of the PCM
in-memory computing primitives (for both 2-minterm encoding crossbar modules. The encoder’s energy expense for processing a
and the AM search) using the prototype PCM chip (see Methods query reduces by a factor of 3.50 with the PCM crossbar implemen-
and Supplementary Note 1). Figure 4b summarizes the accuracy tation, whereas that of the AM search module reduces by a factor of
results with software, the PCM statistical model and the on-chip 117.5. However, these efficiency factors are partially masked by the
experiment for the language and news classification benchmarks. CMOS peripheral circuitry that is common to both implementa-
Compared with the previous experiment, where only AM was con- tions, specifically that in the encoder module, which accounts for
tained on-chip, the full chip experiment results show a similar accu- the majority of its energy consumption. When peripheral circuits
racy level, indicating the minimal effect on accuracy when porting are ignored and only the parts of the design that are exclusive to
the IM into PCM devices with in-memory n-gram encoding. each approach are directly compared to each other, 14.4× and 334×
Furthermore, the accuracy level reported in this experiment is close energy savings and 24.5× and 31.9× area savings are obtained for
to the accuracy reported with the software for the same parametric the encoder and AM search module, respectively. It remains part
configuration of the HD inference model. of future work to investigate methods in which peripheral modules

332 Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics


NATurE ElECTrOnICS Articles
a
Symbol

Index
buffer

s[k]
k s[k]
k

SA SA SA SA SA SA

Ngram_shift Minterm
buffers
(1) (2) (d) (d) (2) (1)

Ngram_start

Configure (1) (2) (3) (d)


n-gram hypervector
interface + + + +
Ngram_end
Controller I

Sum hypervector
< < < Bundler <
Query_end

Encoder Query_valid
(1) (2) (3) (d ) Query hypervector

Common Controller (1) (2) (3) (d/


d f)
f
CMOS modules II

Original IM
crossbar (1)

ADC
Complementary (2)
Partition
r i r1 (1)
IM crossbar
AM (2)

Sum buffer
AM search (c)
c
crossbar search

WTA
T
AM search
partitions (c)
c
(1)
ADC

AD interfaces Partition
a t rf (2)

Class
Control path Partition_select index
(c)
c
Digital data path

b
Language News
100
95.16 94.94
Accuracy (%)

96 92.83 92.81
92
87.85 87.17 87.57 87.3
88
84
80
Software Simulat
a ion AM on AM+IM on Software Simulat
a ion AM on AM+IM on
PCM chip PCM chip PCM chip PCM chip

Fig. 4 | The complete in-memory HDC system. a, Schematic of the architecture, showing the 2-minterm encoder and AM search engine employing the
dotp metric. b, The classification accuracy results on the news and language datasets, where both the encoding and AM search are performed in software,
simulated using the PCM model and are experimentally realized on the chip.

are designed more energy efficiently so that the overall system effi- in-memory with logical and dot-product operations on memristive
ciency can be improved further. devices. Due to the inherent robustness of HDC to errors, it was
possible to approximate the mathematical operations associated
Conclusions with HDC to make it suitable for hardware implementation, and to
HDC is a brain-inspired computational framework that is par- use analog in-memory computing without significantly degrading
ticularly well suited for the emerging computational paradigm of the output accuracy. Our architecture is programmable to support
in-memory computing. We have reported a complete in-memory different hypervector representations, dimensionality and number
HDC system whose two main components are an encoder and of input symbols and output classes to accommodate a variety of
an AM search engine. The main computations are performed applications.

Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics 333


Articles NATurE ElECTrOnICS

Table 1 | Performance comparison between a dedicated all-CMOS implementation and in-memory HDC with PCM crossbars
All-CMOS PCM crossbar based
Encoder AM search Total Encoder AM search Total
Energy
 Average energy per query (nJ) 1,474 1,110 2,584 420.8 9.44 430.3
 Improvement 3.50× 117.5× 6.01×
 Exclusive modules avg. energy per query (nJ) 1,132 1,104 2,236 78.60 3.30 81.90
Improvement 14.40× 334.6× 27.30×
Area
 Total area (mm2) 4.77 2.99 7.76 1.39 0.68 2.07
 Improvement 3.43× 4.38× 3.74×
 Exclusive modules area (mm2) 3.53 2.38 5.91 0.14 0.075 0.22
 Improvement 24.57× 31.94× 27.09×

Vread generated via an off-chip DAC. The sensed current, Iread, was integrated by a
Hardware/software experiments using a prototype PCM chip capacitor, and the resulting voltage was then digitized by the on-chip 8 bit cyclic
delivered accuracies comparable to software baselines on language ADC. The total time of one read was 1 μs. To program a PCM cell, a voltage Vprog
and news classification benchmarks with 10,000-dimensional generated off chip was converted on chip into a programming current, Iprog. This
hypervectors. These experiments used hardware PCM devices to current was then mirrored into the selected bitline for the desired duration of the
implement both in-memory encoding and AM search, thus dem- programming pulse. The pulse used to program the PCM to the amorphous state
(reset) was a box-type rectangular pulse with duration of 400 ns and amplitude
onstrating the hardware functionality of all the operations involved of 450 μA. The pulse used to program the PCM to the crystalline state (set) was
in a generic HDC processor for learning and inference. A compara- a ramp-down pulse with total duration of ~12 μs. The access-device gate voltage
tive study performed against a system-level design implemented (wordline voltage) was kept high at 2.75 V during the programming pulses. These
using 65 nm CMOS technology showed that the in-memory HDC programming conditions were optimized to achieve the highest on/off ratio and to
approach could result in more than 6× end-to-end savings in energy. minimize device-to-device variability for binary storage.
By designing more energy-efficient peripheral circuits and with the Datasets to evaluate in-memory HDC. We targeted three highly relevant learning
potential of scaling PCM devices to nanoscale dimensions43, these and classification tasks to evaluate the proposed in-memory HDC architecture.
gains could increase several-fold. The in-memory HDC concept is These tasks demand a generic programmable architecture to support different
also applicable to other types of memristive device based on ionic numbers of inputs, classes and data types, as shown in Extended Data Table 1. In
the following, we describe these tasks that are used to benchmark the performance
drift44 and magnetoresistance45. Future work will focus on taking
of in-memory HDC in terms of classification accuracy.
in-memory HDC beyond learning and classification to perform
advanced cognitive tasks alongside data compression and retrieval 1. Language classification: in this task, HDC is applied to classify raw text
composed of Latin characters into their respective language48. The train-
on dense storage devices, as well as building more power-efficient ing texts are taken from the Wortschatz Corpora49, where large numbers
peripheral hardware to harness the best of in-memory computing. of sentences (about a million bytes of text) are available for 22 European
languages. Another independent dataset, Europarl Parallel Corpus50, with
Methods 1,000 sentences per language, is used as the test dataset for the classification.
PCM-based hardware platform. The experimental hardware platform is built The former database is used for training 22 prototype hypervectors for each
around a prototype PCM chip that contains PCM cells based on doped-Ge2Sb2Te5 of the languages while the latter is used to run inference on the trained HDC
(d-GST) that are integrated into the prototype chip in 90 nm CMOS baseline model. For subsequent simulations and experiments with the language data-
technology. In addition to the PCM cells, the prototype chip integrates the circuitry set, we use dimensionality d = 10,000 and n-gram size n = 4. We use an IM of
for cell addressing, on-chip ADCs for cell readout and voltage- or current-mode 27 symbols, representing the 26 letters of the Latin alphabet plus a whitespace
cell programming. The experimental platform comprises the following main units: character. Training is performed using the entire training dataset, containing
a labelled text of 120,000–240,000 words per language. For inference, a query
• a high-performance analog-front-end (AFE) board that contains
is composed of a single sentence of the test dataset, so, in total, 1,000 queries
digital-to-analog converters (DACs) along with discrete electronics, such as
per language are used.
power supplies, voltage and current reference sources
2. News classification: the news dataset comprises a database of Reuters news
• a field-programmable gate array (FPGA) board that implements the data
articles, subjected to a lightweight pre-processing step, covering eight differ-
acquisition and digital logic to interface with the PCM device under test and
ent news genres51. The pre-processing step removes frequent ‘stop’ words and
with all the electronics of the AFE board
words with fewer than three letters. The training set has 5,400+ documents,
• a second FPGA board with an embedded processor and Ethernet connection
while the testing set contains 2,100+ documents. For subsequent simulations
that implements the overall system control and data management as well as
and experiments with the news dataset, we use dimensionality d = 10,000 and
the interface with the host computer
n-gram size n = 5, as suggested in ref. 18. Similar to the language task, we use
The prototype chip46 contains three million PCM cells, as well as the CMOS an IM of 27 symbols, representing the 26 letters of the Latin alphabet plus a
circuitry to address, program and read out any of these three million cells. In the whitespace character. Training is performed using the entire training dataset,
PCM devices used for experimentation, two 240-nm-wide access transistors were where all labelled documents pertaining to the same class are merged into
used in parallel per PCM element (cell size of 50 F2). The PCM array is organized a single text. This merged text contains 8,000–200,000 words per class. For
as a matrix of 512 wordlines and 2,048 bitlines. The PCM cells were integrated inference, a query is composed of a single document of the test dataset.
into the chip in 90 nm CMOS technology using the keyhole process47. The bottom 3. Hand gesture recognition from EMG signals: in this task, we focus on use
electrode had a radius of ~20 nm and length of ~65 nm. The phase-change material of HDC in a smart prosthetic application, namely hand gesture recognition
was ~100 nm thick and extended to the top electrode, whose radius was ~100 nm. from a stream of EMG signals. A database52 that provides EMG samples
The selection of one PCM cell was performed by serially addressing a wordline recorded from four channels covering the forearm muscles is used for this
and a bitline. The addresses were decoded and drove the wordline driver and benchmark. Each channel data is quantized into 22 intensity levels of electric
the bitline multiplexer. The single selected cell could be programmed by forcing potential. The sampling frequency of the EMG signal is 500 Hz.
a current through the bitline with a voltage-controlled current source. It could A label is provided for each time sample. The label varies from one to five cor-
also be read by an 8 bit on-chip ADC. To read a PCM cell, the selected bitline responding to five classes of performed gestures. This dataset is used to train
was biased to a constant voltage of 300 mV by a voltage regulator via a voltage an HDC model to detect hand gestures of a single subject. For training on the

334 Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics


NATurE ElECTrOnICS Articles
EMG dataset, a spatial encoding scheme is first employed to fuse data from search module’, the query hypervectors are generated using the same software HD
the four channels so the IM has four discrete symbols, and it is paired with encoder used for training. In the experiments of section ‘The complete in-memory
a continuous IM to jointly map the 22 intensity levels per channel (details HDC system’, the query hypervectors are generated with in-memory encoding
about the encoding procedure for the EMG dataset are provided in Supple- using the prototype PCM chip as described in the section ‘Experiments on the
mentary Note 5). The pairing of IM and continuous IM allows a combination complete in-memory HDC system’.
of orthogonal mapping with distance proportionality mapping. The spatial The AM search on a given query hypervector is performed using the prototype
encoding creates one hypervector per time sample. PCM chip as follows. The components of the query hypervector carrying a value
A temporal encoding step is then performed, whereby n consecutive spatially 1 trigger a read (300 mV applied voltage) on the devices storing the corresponding
encoded hypervectors are combined into an n-gram. For the subsequent components of prototype hypervectors, thus realizing the analog multiplications
simulations and experiments with the EMG dataset we use dimensional- through Ohm’s law of the in-memory dot-product operation. The same procedure
ity d = 10,000 and n-gram size n = 5. Training and inference are performed is performed with the complementary query hypervector on the devices storing
using the same EMG channel signals from the same subject, but on complementary prototype hypervectors. The resulting current values are digitized
non-overlapping sections of recording. The recording used for training con- via the on-chip ADC, transferred to the host computer and classwise summed up
tains 1,280 time samples after downsampling by a factor of 175. For inference, in software according to the predetermined partition order to obtain classwise
780 queries are generated from the rest of the recording, where each query similarity values (Supplementary Note 1). The class with the highest similarity
contains five time samples captured with the same downsampling factor. is assigned as the predicted class for the given query. For experiments with dotp
as the similarity metric, the devices attributed to complementary prototype
For the different tasks, Extended Data Table 1 provides details on the desired
hypervectors are not read when forming the classwise aggregate.
hypervector representations and different hyperparameters including the dimension
of hypervectors, the alphabet size, the n-gram size and the number of classes. For
More details on the 2-minterm encoder. To generate an n-gram hypervector in
the EMG dataset, the hypervectors for the encoding operation are drawn by binding
n cycles, the crossbar is operated using the following procedure. During the first
items from a pair of IM and continuous IM (Supplementary Note 5). In hardware
cycle, n-gram encoding is initiated by asserting the ‘start’ signal while choosing
implementation of the in-memory HDC, the IM and AM may be distributed into
the index of the nth symbol s[n]. This enables all the gate lines in both crossbar
multiple narrower crossbars in case electrical/physical limitations arise.
arrays and the wordline corresponding to s[n] to be activated. The current released
onto the bitlines passed through the sense amplifiers should ideally match the
Coarse-grained randomization. The programming methodology followed to
logic levels of B[n] in first array and B½n in the second array. The two ‘minterm
achieve the coarse-grained randomized partitioning in the memristive crossbar
buffers’ downstream of the sense amplifier
I arrays register the two hypervectors
for the AM search is explained in the following steps. First, we split all prototype
by the end of the first cycle. During subsequent jth (1 < j ≤ n) cycles, the gate lines
hypervectors (P1, P2,..., Pc) into f subvectors of equal length, where f is the partition
are driven by the right-shifted version of the incumbent values on the minterm
factor. For example, subvectors from the prototype hypervector of the first class
f buffers—effectively implementing permutation—while row decoders are fed with
are denoted as ( P11 , P21 , ..., P1 ). The crossbar array is then divided into f equally
symbol s[n − j + 1] (the left shift is used for the second crossbar). This ensures
sized partitions (rI 1, rI2, ..., rf).I Each partition must contain d/f rows and c columns.
that the output currents on the bitlines correspond to the component-wise
A random permutation e of numbers 1 to c is then selected. Next, the first
logical AND between the permuted minterm buffer values and the next basis
subvector from each class ( P11 , P12 , ..., P1c ) is programmed into the first partition r1
hypervector B[n − j] (complement for the second array). The expression for
such that each subvector fitsIto aI column I in the crossbar partition. The order of
theQvalue stored on the left-side minterm buffers at the end of jth cycle is given
programming of subvectors into the columns in the partition is determined by the j
by k¼1 ρj�k B½n � k þ 1. The product of the complementary hypervectors
previously selected random permutation e. The above steps must be repeated to Qj I j�k
program all the remaining partitions (r2, r3, ..., rf). k¼1 ρ B½n � k þ 1 is stored in the right-side minterm buffers. At the end of
The methodology followed in feeding query vectors during inference is I nth cycle, the two minterms are available in the minterm buffers. The elements
the
detailed in the following steps. First, we split query hypervector Q into f subvectors in the minterm buffers are passed onto the OR gate array following the minterm
(Q1, Q2,...,Qf) of equal length. We then translate Qi component values into voltage buffers (shown in Fig. 3a), such that inputs to the array have matching indices from
levels and apply them onto the wordline drivers in the crossbar array. Bitlines the two minterm vectors. At this point, the output of the OR gate array reflects
corresponding to the partition ri are enabled. Depending on the belonging class, the desired n-gram hypervector from 2-minterm n-gram encoding. After n-gram
the partial dot products are then collected onto the respective destination in a encoding, the generated n-grams are accumulated and binarized. In the hardware
sum buffer through AD converters at the end of the ri partition of the array. This implementation, this step is realized inside the bundler module shown in Fig. 4a.
procedure is repeated for each partition ri. Classwise partial dot products are The threshold applied to binarize the sum hypervector components is given by
accumulated together in each iteration and updated in the sum buffer. After the fth  
1
iteration, full dot-product values are ready in the sum buffer. The results are then l ´ n�logðmÞ
compared against each other using a WTA circuit to find the maximum value to 2
assign its index as the predicted class. where l is the length of the sequence, n is the n-gram size and m is the number of
minterms used for the binding operation in the encoder (for example, m = 2 for
Experiments on AM search. To obtain the prototype hypervectors used for the 2-minterm encoder).
AM search, training with HDC is first performed in software on the three datasets
described in the section ‘Datasets to evaluate in-memory HDC’. For the language Experiments on the complete in-memory HDC system. For the experiments
and news datasets, XOR-based encoding (see section ‘The concept of in-memory concerning the complete in-memory HDC system, training with HDC is first
HDC’) is used with an n-gram size of n = 4 and n = 5, respectively. For the EMG performed in software on the language and news datasets. 2-minterm encoding
dataset, an initial spatial encoding step creates one hypervector per time sample. (equation (5)) is used with n-gram sizes of n = 4 and n = 5, respectively.
A temporal encoding step is then performed, whereby n consecutive spatially After training is performed, h × d × 2 devices are allocated on the PCM chip for
encoded hypervectors are combined into an n-gram with XOR-based encoding storing IM and complementary IM in addition to d × c devices allocated for AM.
and n = 5. The detailed encoding procedure for the EMG dataset is explained in The IM and complementary IM hypervectors are programmed on PCM devices
Supplementary Note 5. in a single shot with reset/set pulses for logical 0/1 components. The prototype
Once training is performed, the prototype hypervectors are programmed hypervectors of the AM are programmed as described in the section ‘Experiments
on the prototype PCM chip. In the experiment conducted with invHamm as the on AM search’, with the exception that the complementary prototype hypervectors
similarity metric, d × c × 2 devices on the PCM prototype chip are allocated. Each are not programmed because dotp is used as the similarity metric.
device in the first half of the address range (from 1 to d × c) is programmed with During inference, for every query to be encoded, the IM and complementary
a component of a prototype hypervector Pi, where i = 1, …, c. Devices in the IM are read from the prototype PCM chip. In-memory read logic (AND) is
second half of the array are programmed with components of the complementary performed by thresholding the read current values from the on-chip ADC in
prototype hypervectors. The exact programming order is determined by the software to emulate the sense amplifiers of the eventual proposed hardware
partition factor (f) employed in the coarse-grained randomized partitioning at each step of the 2-minterm n-gram encoding process (Supplementary Note
scheme. For f = 10 used in the experiment, devices from the first address up to the 1). The other operations involved in the encoder that are not supported by the
1,000 × cth address are programmed with the content of the first partition, that prototype PCM chip, such as the 1 bit right-shift permutation, storing of the
is, the first segment of each prototype hypervector. The second set of 1,000 × c intermediate results in the minterm buffers, ORing the results of the original
addresses is programmed with content of the second partition, and so on. As the and complementary minterm buffers, and the bundling of n-gram hypervectors
hypervector components are binary, devices mapped to the logical 1 components are implemented in software. Once the encoding of the query hypervector is
and devices mapped to logical 0 components are programmed to the maximum completed, the AM search is carried out on that query hypervector as specified in
(~20 μS) and minimum conductance (~0 μS) levels, respectively. The devices are the section ‘Experiments on AM search’ with dotp as the similarity metric.
programmed in a single shot (no iterative program-and-verify algorithm is used)
with a single reset/set pulse for minimum/maximum conductance devices. Performance, energy estimation and comparison. To evaluate and benchmark
Once the programming phase is completed, the queries from the testing the energy efficiency of the proposed architecture, a cycle-accurate register transfer
set of a given task are encoded. Only for the experiments in section ‘3The AM level (RTL) model of a complete CMOS design that has equivalent throughput to

Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics 335


Articles NATurE ElECTrOnICS
that of the proposed in-memory HDC system architecture has been developed 13. Kleyko, D., Osipov, E., Gayler, R. W., Khan, A. I. & Dyer, A. G. Imitation of
(Supplementary Note 6). A testbench infrastructure is then built to verify the honey bees’ concept learning processes using vector symbolic architectures.
correct behaviour of the model. Once the behaviour is verified, the RTL model is Biol. Inspired Cogn. Architectures 14, 57–72 (2015).
synthesized in a UMC 65 nm technology node using a Synopsys Design Compiler. 14. Slipchenko, S. V. & Rachkovskij, D. A. Analogical mapping using similarity of
Owing to the limitations in the electronic design automation (EDA) tools used for binary distributed representations. Inf. Theories Appl. 16, 269–290 (2009).
synthesizing the CMOS-based HDC, dimensionality d had to be limited to 2,000. 15. Bandaragoda, T. et al. Trajectory clustering of road traffic in urban
The post-synthesis netlist is then verified using the same stimulus vectors applied environments using incremental machine learning in combination with
during behavioural simulation. During post-synthesis netlist simulation, the design hyperdimensional computing. In Proceedings of the IEEE Intelligent
is clocked at a frequency of 440 MHz to create a switching activity file in value Transportation Systems Conference (ITSC) 1664–1670 (IEEE, 2019).
change dump (VCD) format for inference of 100 language classification queries. 16. Osipov, E., Kleyko, D. & Legalov, A. Associative synthesis of finite state
Then, the energy estimation for the CMOS modules is performed by converting automata model of a controlled object with hyperdimensional computing. In
average power values reported by Synopsys Primetime, which takes the netlist and Proceedings of the Annual Conference of the IEEE Industrial Electronics Society
the activity file from the previous steps as the inputs. A typical operating condition 3276–3281 (IEEE, 2017).
with voltage 1.2 V and temperature 25 °C is set as the corner for energy estimation 17. Kleyko, D., Frady, E. P. & Osipov, E. Integer echo state networks:
of the CMOS system. Further energy and area results were obtained for d values hyperdimensional reservoir computing. Preprint at https://arxiv.org/
of 100, 500, 1,000 in addition to 2,000. The results were then extrapolated to pdf/1706.00280.pdf (2017).
derive the energy and area estimates for dimensionality d = 10,000 to obtain a fair 18. Rahimi, A. et al. High-dimensional computing as a nanoscalable paradigm.
comparison with the in-memory HDC system. IEEE Trans. Circuits Syst. I Regular Papers 64, 2508–2521 (2017).
The energy/area of the proposed in-memory HDC system architecture is 19. Yang, J. J., Strukov, D. B. & Stewart, D. R. Memristive devices for computing.
obtained by adding the energy/area of the modules that are common with the full Nat. Nanotechnol. 8, 13–24 (2013).
CMOS design described above, together with the energy of the PCM crossbars 20. Sebastian, A. et al. Temporal correlation detection using computational
and the analog/digital peripheral circuits exclusive to the in-memory HDC phase-change memory. Nat. Commun. 8, 1115 (2017).
architecture. Parameters based on the prototype PCM chip in the 90 nm technology 21. Zidan, M. A., Strachan, J. P. & Lu, W. D. The future of electronics based on
used in the experiments are taken as the basis for the PCM-exclusive energy/ memristive systems. Nat. Electron. 1, 22–29 (2018).
area estimation. The parameters of the sense amplifiers that are not present in the 22. Ielmini, D. & Wong, H.-S. P. In-memory computing with resistive switching
PCM hardware platform but present in the proposed in-memory HD encoder are devices. Nat. Electron. 1, 333–343 (2018).
taken from the 65 nm current latched sense amplifier presented by Chandoke and 23. Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory
others53. The area of the current latched sense amplifier was estimated by scaling devices and applications for in-memory computing. Nat. Nanotechnol. https://
the area of the six-transistor SRAM cell in IBM 65 nm technology (0.54 μm2) doi.org/10.1038/s41565-020-0655-z (2020).
according to the number of transistors present in the sense amplifier (19). The 24. Li, H. et al. Hyperdimensional computing with 3D VRRAM in-memory
parameters used for the PCM crossbars energy estimation are shown in Extended kernels: device-architecture co-design for energy-efficient, error-resilient
Data Table 2. language recognition. In Proceedings of the IEEE International Electron
Devices Meeting (IEDM) 16.1.1–16.1.4 (IEEE, 2016).
Data availability 25. Li, H., Wu, T. F., Mitra, S. & Wong, H. S. P. Device-architecture co-design for
The data that support the plots within this paper and other findings of this study hyperdimensional computing with 3D vertical resistive switching random
are available from the corresponding author upon reasonable request. access memory (3D VRRAM). In Proceedings of the International Symposium
on VLSI Technology, Systems and Application (VLSI-TSA) 1–2 (IEEE, 2017).
26. Wu, T. F. et al. Brain-inspired computing exploiting carbon nanotube FETs and
Received: 13 November 2019; Accepted: 7 April 2020; resistive RAM: hyperdimensional computing case study. In Proceedings of the
Published online: 1 June 2020 International Solid State Circuits Conference (ISSCC) 492–494 (IEEE, 2018).
27. Kanerva, P. Binary spatter-coding of ordered k-tuples. In Proceedings of the
References International Conference on Artificial Neural Networks (ICANN), Vol. 1112,
1. Kanerva, P. Sparse Distributed Memory (MIT Press, 1988). 869–873 (Lecture Notes in Computer Science, Springer, 1996).
2. Kanerva, P. Hyperdimensional computing: an introduction to computing in 28. Joshi, A., Halseth, J. T. & Kanerva, P. Language geometry using random
distributed representation with high-dimensional random vectors. Cogn. indexing. In Proceedings of the International Symposium on Quantum
Comput. 1, 139–159 (2009). Interaction 265–274 (Springer, 2016).
3. Kanerva, P., Kristoferson, J. & Holst, A. Random indexing of text samples for 29. Chua, L. Resistance switching memories are memristors. Appl. Phys. A 102,
latent semantic analysis. In Proceedings of the Annual Meeting of the Cognitive 765–783 (2011).
Science Society Vol. 22 (Cognitive Science Society, 2000). 30. Wong, H.-S. P. & Salahuddin, S. Memory leads the way to better computing.
4. Rahimi, A., Kanerva, P., Benini, L. & Rabaey, J. M. Efficient biosignal processing Nat. Nanotechnol. 10, 191–194 (2015).
using hyperdimensional computing: network templates for combined learning 31. Borghetti, J. et al. ‘Memristive’ switches enable ‘stateful’ logic operations via
and classification of ExG signals. Proc. IEEE 107, 123–143 (2019). material implication. Nature 464, 873–876 (2010).
5. Burrello, A., Cavigelli, L., Schindler, K., Benini, L. & Rahimi, A. Laelaps: an 32. Kvatinsky, S. et al. Magic—memristor-aided logic. IEEE Trans. Circuits Syst II
energy-efficient seizure detection algorithm from long-term human iEEG Express Briefs 61, 895–899 (2014).
recordings without false alarms. In Proceedings of the Design, Automation & 33. Shen, W. et al. Stateful logic operations in one-transistor-one-resistor resistive
Test in Europe Conference & Exhibition (DATE) 752–757 (IEEE, 2019). random access memory array. Electron Device Lett. 40, 1538–1541 (2019).
6. Räsänen, O. J. & Saarinen, J. P. Sequence prediction with sparse distributed 34. Wong, H.-S. P. et al. Phase change memory. Proc. IEEE 98, 2201–2227 (2010).
hyperdimensional coding applied to the analysis of mobile phone use 35. Burr, G. W. et al. Recent progress in phase-change memory technology. IEEE
patterns. IEEE Trans. Neural Netw. Learn. Syst. 27, 1878–1889 (2015). J. Emerging Selected Topics Circuits Syst. 6, 146–162 (2016).
7. Kleyko, D. & Osipov, E. Brain-like classifier of temporal patterns. In 36. Kuzum, D., Jeyasingh, R. G., Lee, B. & Wong, H.-S. P. Nanoelectronic
Proceedings of the International Conference on Computer and Information programmable synapses based on phase change materials for brain-inspired
Sciences (ICCOINS) 1–6 (IEEE, 2014). computing. Nano Lett. 12, 2179–2186 (2011).
8. Kleyko, D., Osipov, E., Papakonstantinou, N. & Vyatkin, V. Hyperdimensional 37. Tuma, T., Pantazi, A., Le Gallo, M., Sebastian, A. & Eleftheriou, E. Stochastic
computing in industrial systems: the use-case of distributed fault isolation in phase-change neurons. Nat. Nanotechnol. 11, 693–699 (2016).
a power plant. IEEE Access 6, 30766–30777 (2018). 38. Boybat, I. et al. Neuromorphic computing with multi-memristive synapses.
9. Chang, E., Rahimi, A., Benini, L. & Wu, A. A. Hyperdimensional Nat. Commun. 9, 2514 (2018).
computing-based multimodality emotion recognition with physiological 39. Sebastian, A. et al. Tutorial: brain-inspired computing using phase-change
signals. In Proceedings of the IEEE International Conference on Artificial memory devices. J. Appl. Phys. 124, 111101 (2018).
Intelligence Circuits and Systems (AICAS) 137–141 (IEEE, 2019). 40. Joshi, V. et al. Accurate deep neural network inference using computational
10. Mitrokhin, A., Sutor, P., Fermüller, C. & Aloimonos, Y. Learning phase-change memory. Nat. Commun. https://doi.org/10.1038/s41467-020-
sensorimotor control with neuromorphic sensors: toward hyperdimensional 16108-9 (2020).
active perception. Sci. Robot. 4, eaaw6736 (2019). 41. Hosseini, P., Sebastian, A., Papandreou, N., Wright, C. D. & Bhaskaran, H.
11. Montagna, F., Rahimi, A., Benatti, S., Rossi, D. & Benini, L. PULP-HD: Accumulation-based computing using phase-change memories with FET
accelerating brain-inspired high-dimensional computing on a parallel access devices. Electron Device Lett. 36, 975–977 (2015).
ultra-low power platform. In Proceedings of the 55th Annual Design 42. Le Gallo, M. et al. Mixed-precision in-memory computing. Nat. Electron. 1,
Automation Conference DAC 2018, 111:1–111:6 (ACM, 2018). 246–253 (2018).
12. Emruli, B., Gayler, R. W. & Sandin, F. Analogical mapping and inference with 43. Xiong, F., Liao, A. D., Estrada, D. & Pop, E. Low-power switching of
binary spatter codes and sparse distributed memory. In Proceedings of the phase-change materials with carbon nanotube electrodes. Science 332,
International Joint Conference on Neural Networks (IJCNN) 1–8 (IEEE, 2013). 568–570 (2011).

336 Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics


NATurE ElECTrOnICS Articles
44. Waser, R. & Aono, M. in Nanoscience and Technology: a Collection of Reviews Acknowledgements
from Nature Journals 158–165 (World Scientific, 2010). This work was supported in part by the European Research Council through the
45. Kent, A. D. & Worledge, D. C. A new spin on magnetic memories. European Union’s Horizon 2020 Research and Innovation Programme under grant no.
Nat. Nanotechnol. 10, 187–191 (2015). 682675 and in part by the European Union’s Horizon 2020 Research and Innovation
46. Close, G. et al. Device, circuit and system-level analysis of noise in multi-bit Programme through the project MNEMOSENE under grant no. 780215.
phase-change memory. In Proceedings of the International Electron Devices
Meeting (IEDM) 29.5.1–29.5.4 (IEEE, 2010).
47. Breitwisch, M. et al. Novel lithography-independent pore phase change memory. Author contributions
In Proceedings of the Symposium on VLSI Technology 100–101 (IEEE, 2007). All authors collectively conceived the idea of in-memory hyperdimensional computing.
48. Rahimi, A., Kanerva, P. & Rabaey, J. M. A robust and energy-efficient G.K. performed the experiments and analysed the results under the supervision of
classifier using brain-inspired hyperdimensional computing. In Proceedings of M.L.G., A.R. and A.S. G.K., M.L.G., A.R. and A.S. wrote the manuscript with input
the 2016 International Symposium on Low Power Electronics and Design from all authors.
ISLPED 2016, 64–69 (ACM, 2016).
49. Quasthoff, U., Richter, M. & Biemann, C. Corpus portal for search in
monolingual corpora. In Proceedings of the International Conference on
Competing interests
The authors declare no competing interests.
Language Resources and Evaluation (LREC) 1799–1802 (ELRA, 2006).
50. Koehn, P. Europarl: a parallel corpus for statistical machine translation. In
Proceedings of the MT Summit Vol. 5, 79–86 (AAMT, 2005). Additional information
51. Mimaroglu, D. S. Some Text Datasets (Univ. Massachusetts, accessed 9 March Extended data is available for this paper at https://doi.org/10.1038/s41928-020-0410-3.
2018); https://www.cs.umb.edu/smimarog/textmining/datasets/
52. Rahimi, A., Benatti, S., Kanerva, P., Benini, L. & Rabaey, J. M. Supplementary information is available for this paper at https://doi.org/10.1038/
Hyperdimensional biosignal processing: a case study for EMG-based hand s41928-020-0410-3.
gesture recognition. In Proceedings of the 2016 IEEE International Conference Correspondence and requests for materials should be addressed to A.R. or A.S.
on Rebooting Computing (ICRC) 1–8 (IEEE, 2016). Reprints and permissions information is available at www.nature.com/reprints.
53. Chandoke, N., Chitkara, N. & Grover, A. Comparative analysis of sense
amplifiers for SRAM in 65 nm CMOS technology. In Proceedings of the Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
International Conference on Electrical, Computer and Communication published maps and institutional affiliations.
Technologies (ICECCT), 1–7 (IEEE, 2015). © The Author(s), under exclusive licence to Springer Nature Limited 2020

Nature Electronics | VOL 3 | June 2020 | 327–337 | www.nature.com/natureelectronics 337


Articles NATurE ElECTrOnICS

Extended Data Table 1 | Architecture configurations and hyperparameters used for the tree different tasks

Nature Electronics | www.nature.com/natureelectronics


NATurE ElECTrOnICS Articles
Extended Data Table 2 | Parameters for PCM crossbars energy and area estimation

Nature Electronics | www.nature.com/natureelectronics

You might also like