0% found this document useful (0 votes)
16 views6 pages

DAC'22 - GENERIC - Highly Efficient Learning Engine On Edge Using Hyperdimensional Computing

The document presents GENERIC, a highly efficient learning engine utilizing Hyperdimensional Computing (HDC) for edge applications. It introduces a novel encoding method that enhances accuracy across diverse applications while also designing a flexible ASIC accelerator that significantly reduces energy consumption. Compared to previous HDC and machine learning techniques, GENERIC improves classification accuracy by 3.5% and 6.5%, respectively, while achieving a 4.1× reduction in energy consumption.

Uploaded by

zsq643382008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

DAC'22 - GENERIC - Highly Efficient Learning Engine On Edge Using Hyperdimensional Computing

The document presents GENERIC, a highly efficient learning engine utilizing Hyperdimensional Computing (HDC) for edge applications. It introduces a novel encoding method that enhances accuracy across diverse applications while also designing a flexible ASIC accelerator that significantly reduces energy consumption. Compared to previous HDC and machine learning techniques, GENERIC improves classification accuracy by 3.5% and 6.5%, respectively, while achieving a 4.1× reduction in energy consumption.

Uploaded by

zsq643382008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

GENERIC: Highly Efficient Learning Engine on Edge using

Hyperdimensional Computing
Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing
Department of Computer Science and Engineering, UC San Diego, La Jolla, CA 92093
{bkhaleghi, j5kang, hax032, j1morris, tajana}@ucsd.edu
Abstract HDC uses specific algorithms to encode raw inputs to a high-
dimensional representation of hypervectors with Dℎ𝑣 ≈2−5𝐾 di-
Hyperdimensional Computing (HDC) mimics the brain’s basic
mensions. The encoding takes place by deterministically associating
principles in performing cognitive tasks by encoding the data to
each element of an input with a binary or bipolar (±1) hypervec-
high-dimensional vectors and employing non-complex learning
tor and bundling (element-wise addition) the hypervectors of all
techniques. Conventional processing platforms such as CPUs and
elements to create the encoded hypervector. Training is straightfor-
GPUs are incapable of taking full advantage of the highly-parallel
ward and involves bundling all encoded hypervectors of the same
bit-level operations of HDC. On the other hand, existing HDC en-
category. For inference, the query input is encoded to a hypervector
coding techniques do not cover a broad range of applications to
in the same fashion and compared with all class hypervectors using
make a custom design plausible. In this paper, we first propose a
a simple similarity metric such as cosine.
novel encoding that achieves high accuracy for diverse applications.
The bit-level massively parallel operations of HDC do not accord
Thereafter, we leverage the proposed encoding and design a highly
well with conventional CPUs/GPUs due to, e.g., memory latency
efficient and flexible ASIC accelerator, dubbed GENERIC, suited for
and data movement of large vectors and the fact that these devices
the edge domain. GENERIC supports both classification (train and
are over-provisioned for majorly binary operations of HDC. Pre-
inference) and clustering for unsupervised learning on edge. Our
vious works on custom HDC accelerators support a limited range
design is flexible in the input size (hence it can run various appli-
of applications or achieve low accuracy. The authors of [6] and [7]
cations) and hypervectors dimensionality, allowing it to trade off
propose custom HDC inference designs that are limited to a specific
the accuracy and energy/performance on-demand. We augment
application. More flexible HDC inference ASICs are proposed in
GENERIC with application-opportunistic power-gating and voltage
[8] and [9], but as we quantify in Section 3.2, the utilized encoding
over-scaling (thanks to the notable error resiliency of HDC) for
techniques achieve poor accuracy for particular applications such
further energy reduction. GENERIC encoding improves the predic-
as time-series. The authors of [10] propose a trainable HDC acceler-
tion accuracy over previous HDC and ML techniques by 3.5% and
ator, which yields 9% lower accuracy than baseline ML algorithms.
6.5%, respectively. At 14 nm technology node, GENERIC occupies
An HDC-tailored processor is proposed in [11], but it consumes
an area of 0.30 mm2 , and consumes 0.09 mW static and 1.97 mW
∼1−2 orders of magnitude more energy than ASIC counterparts.
active power. Compared to the previous inference-only accelerator,
The in-memory HDC platform of [12] uses low-leakage PCM cells
GENERIC reduces the energy consumption by 4.1×.
to store hypervectors, but its CMOS peripherals throttle the overall
ACM Reference Format: efficiency.
Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Ros- In this paper, we propose GENERIC (highly efficient learning
ing. 2022. GENERIC: Highly Efficient Learning Engine on Edge using Hy-
engine on edge using hyperdimensional computing) for highly
perdimensional Computing. In Proceedings of the 59th ACM/IEEE Design
efficient and accurate trainable classification and clustering. Our
Automation Conference (DAC) (DAC ’22), July 10–14, 2022, San Francisco, CA,
USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3489517. primary goal is to make GENERIC compact and low-power to meet
3530669 year-long battery-powered operation, yet fast enough during train-
ing and burst inference, e.g., when it serves as an IoT gateway. To
this end, we make the following contributions.
1 Introduction (1) We propose a novel HDC encoding that yields high accuracy
Hyperdimensional Computing (HDC) is a novel brain-inspired in various benchmarks. Such a generic encoding is fundamental to
learning paradigm based on the observation that brains perform develop a custom yet flexible circuit.
cognitive tasks by mapping sensory inputs to high-dimensional (2) We perform a detailed comparison of HDC and various ML
neural representation [1–3]. It enables the brain to carry out sim- techniques on conventional devices and point out the failure of
ple, low-power, error-resilient, and parallelizable operations all in these devices in unleashing HDC advantages.
the hyperspace. Such characteristics of HDC make it appealing (3) We propose the GENERIC flexible architecture that implements
for a wide variety of applications such as IoT domain that gener- accurate HDC-based trainable classification and clustering.
ates an increasing amount of data with tight resource and energy (4) GENERIC benefits from extreme energy reduction techniques
constraints [4, 5]. such as application-opportunistic power gating, on-demand dimen-
sion reduction, and error-resilient voltage over-scaling.
Permission to make digital or hard copies of part or all of this work for personal or (5) Comparison of GENERIC with the state-of-the-art HDC imple-
classroom use is granted without fee provided that copies are not made or distributed mentations reveals GENERIC improves the classification accuracy
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. by 3.5% over previous HDC techniques and 6.5% over ML tech-
For all other uses, contact the owner/author(s). niques. GENERIC improves energy consumption by 4.1× and 15.7×
DAC ’22, July 10–14, 2022, San Francisco, CA, USA compared to previous HDC accelerators [8] and [10], respectively.
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9142-9/22/07.
https://doi.org/10.1145/3489517.3530669

1117
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing
Q
Encoded Hypervector Query Encoding -4 +9 0
Encoding +3 -4 +11 Encoding -4 +9 0 Subtract the query from
Similarity Check the incorrect class
HDC Model (Label = 4)
Label = 2 Label = ??? HDC Model
Class 1 − C1 = C1 − Q
Class 1 Score 1
Class 1 î Training Class 2
Training
Dataset
Class 2 + Test Class 2 î
Score 2 Dataset Class 3
Class 3 Dataset Score 3
Class 3 î Class 4 + C4 = C4 + Q
Class 4 Score 4
Class 4 î Predicted Class = 1 Add the query to the
Label = Max correct class
(a) (b) (c)
Figure 1: (a) HDC model training (initialization), (b) inference, and (c) retraining.
2 Hyperdimensional Computing generally need extracting local subsequences of consecutive fea-
2.1 Learning with HDC tures, without considering the global order of these subsequences
(see subsection 3.2). Previous studies use ngram encoding for such
Figure 1 demonstrates the HDC training and inference. During datasets [6, 7, 14]. Ngram encoding extracts all subsequences of
training, each input X is encoded to a hypervector H (X) and length 𝑛 (usually 𝑛∈{3−5}) in a given input, encodes all these subse-
added up to its class hypervector. In the inference, the query is quences and aggregates them to produce the encoded hypervector.
likewise encoded and compared with class hypervectors. The class However, ngram encoding achieves very low accuracy for datasets
index with the highest similarity score is returned as the prediction such as images or voices in which the spatio-temporal information
result. We use cosine distance of the query and class hypervectors of should be taken into account.
as the similarity metric. The accuracy of an HDC model can be We propose a new encoding, dubbed GENERIC, to cover a more
improved by retraining iterations where the encoded train data are versatile set of applications. As shown in Figure 2(d), our encoding
compared with the HDC model, and in case of misprediction, the processes sliding windows of length 𝑛 by applying the permuta-
model is updated by subtracting the encoded hypervector from the tion encoding. That is, for every window consisting of elements
mispredicted class and adding it to the correct class. {𝑥𝑘 , 𝑥𝑘+1, 𝑥𝑘+2 } (for 𝑛=3), three level hypervectors are selected,
The similarity of hypervectors indicates their proximity [1], where ℓ (𝑥𝑘 ), ℓ (𝑥𝑘+1 ), and ℓ (𝑥𝑘+2 ) are permuted by 0, 1, and 2 in-
which can be used to cluster data in the hyperspace [13]. Initially, dexes, respectively. The permuted hypervectors are XORed element-
𝑘 encoded hypervectors are selected as clusters centroids. At each wise to create the window hypervector. The permutation accounts
iteration, all encoded inputs are compared with the centroids and for positional information within a window, e.g., to distinguish
added to the closest (highest score) centroid hypervector. In classi- “abc” and “bca”. To account for global order of features, we associate
fication, the model is updated right away. However, in clustering, a random but constant id hypervector with each window, which
the model is fixed and used for finding the similarities, and a new is XORed with the window hypervector to perform binding. To
model is created from scratch, which replaces the current model in skip the global binding in certain applications, id hypervectors are
the next iteration.
set to {0} Dℎ𝑣 . Equation (1) formalizes  our encoding, where 𝜌
( 𝑗)

2.2 Encoding indicates permutation by 𝑗 indexes, multiplies (XOR in binary)



Encoding is the major step of HDC; hence, previous works have the levels of 𝑖 th window, 𝑖𝑑𝑖 applies the binding 𝑖𝑑, and adds up
proposed several encoding techniques to map the inputs to high- the window hypervector for all windows of 𝑑 elements.
dimensional space. Most encodings associate hypervectors with 𝑑−𝑛+1  𝑛−1
   
the raw input features (elements), called level hypervector (see H (X) = 𝑖𝑑 𝑖 · 𝜌 ( 𝑗) ℓ (𝑥𝑖+𝑗 ) (1)
Figure 2(a)), which are hyperspace representative of scalar elements. 𝑖=1 𝑗=0
Usually, inputs are quantized into bins to limit the number of levels.
If there is a meaningful distance between the input elements (as in We use 𝑛=3 as it achieved the highest accuracy (on average) for
the values of white and black pixels), this distance is also preserved our examined benchmarks (see subsection 3.2), however, GENERIC
when generating the levels. architecture can adjust the value of 𝑛 for every application.
Encoding of an input is accomplished by aggregation the level
hypervectors of its elements. To handle the positional order of 3.2 Accuracy Comparison
elements, which is essential in most datasets such as image or voice,
HDC uses variants of binding. The permutation encoding of Figure We compiled eleven datasets from different domains, consisting
2(b) carries out binding by circular shift of the level hypervectors; of the benchmarks described in [10], seizure detection by skull
surface EEG signals, and user activity recognition by motion sensors
the level hypervector of 𝑚 th feature is permuted by 𝑚 indexes.
(PAMAP2) [15]. We implemented the HDC algorithms using an
Some other encodings such as random projection (RP), shown in
optimized Python implementation that leverages SIMD operations.
Figure 2(c), or level-id use id hypervectors for binding. In these
For ML techniques, we used Python scikit-learn library [16]. We
encodings, each input index has a random (but constant) binary id,
discarded the results of logistic regression and 𝑘-nearest neighbors
which is multiplied (XOR in the binary domain) with its level, and
as they achieved lower accuracy. For DNN models of benchmarks,
the result vector is aggregated with that of other indexes.
we used AutoKeras library [17] for automated model exploration.
3 Proposed HDC Encoding Table 1 summarizes the accuracy results (RP: random projec-
tion, MLP: multi-layer perceptron, SVM: support vector machine,
3.1 GENERIC Encoding RF: random forest). The proposed GENERIC encoding achieves 3.5%
The encoding techniques discussed in Section 2.2 achieve low higher accuracy than the best baseline HDC (level-id), 6.5% higher
accuracy for certain datasets such as language identification which than best baseline ML (SVM), and 1.0% higher than DNN. The

1118
GENERIC: Highly Efficient Learning Engine on Edge using Hyperdimensional Computing DAC ’22, July 10–14, 2022, San Francisco, CA, USA
Permuting and XORing window’s levels
Permuting level hypervectors Pixel value One id per each index
+1 -1 -1 +1 +1 +1 ... +1
+1 -1 -1 +1 +1 +1 ... +1 ͹ × -1 -1 +1 -1 +1 -1 ... +1 id0 ≪0
Level Hypervectors ≪0 ⊕
+
+1 -1 -1 +1 +1 +1 ... +1 + +1 -1 -1 +1 +1 +1 ... +1
≪1
-1 +1 -1 -1 +1 -1 ... +1
+1 +1 +1 -1 +1 -1 ... -1 241 × -1 +1 +1 -1 -1 +1 ... -1 idm ⊕
+1 +1 +1 -1 +1 -1 ... -1 ≪m -1 +1 -1 -1 +1 -1 ... +1
D = 5,000 + + ≪2
L ⋅L ≈0 ⊕
L ⋅ L ≈ 2500 -1 +1 -1 -1 +1 -1 ... +1 122 × -1 -1 +1 +1 +1 -1 ... +1 idn One id per -1 -1 +1 +1 +1 -1 ... +1 idk
window
L ⋅ L ≈ 2500
≪n +
+91 -3 -24 +61 -40 -71 ... +12
+9 0 -2 +1 -4 -7 ... +10
+12 -7 0 +4 -3 +2 ... -13
(a) (b) (c) (d)
Figure 2: (a) Level hypervectors, (b) permutation encoding, (c) random projection encoding, (d) proposed GENERIC encoding.
Table 1: Accuracy of HDC and ML algorithms. Train Inference
1E+3

Energy per input (mJ


Raspberry Pi CPU eGPU
HDC Algorithms ML Algorithms 1E+2
Dataset
RP level-id ngram permute GENERIC MLP SVM RF DNN
CARDIO ͺ͵ǤͲΨ ͺͺǤͳΨ ͺͺǤͳΨ ͺͺǤʹΨ ͻͳǤͺΨ ͺ͸ǤͶΨ ͺ͸ǤͶΨ ͻͷǤ͵Ψ ͻͲǤͳΨ 1E+1
DNA ͻͻǤ͵Ψ ͻͻǤ͵Ψ ͻͻǤ͹Ψ ͻͻǤ͵Ψ ͻͻǤ͹Ψ ͻͻǤͷΨ ͻͻǤͷΨ ͻͻǤͷΨ ͻͻǤͺΨ
1E+0
EEG Ͷ͸ǤͺΨ ͹͹ǤͷΨ ͺ͵ǤͳΨ ͹ͺǤ͵Ψ ͺ͵ǤǤͳΨ ͷ͸ǤͺΨ ͹ͷǤͶΨ ͺͲǤͳΨ ͸ͲǤʹΨ
EMG ͷ͵Ǥ͸Ψ ͻͲǤͻΨ ͻͲǤͺΨ ͻͳǤͳΨ ͻͲǤͻΨ ͻͳǤͲΨ ͺͻǤʹΨ ͺ͵Ǥ͸Ψ ͺͻǤͶΨ 1E-1
FACE ͻͷǤ͵Ψ ͻͷǤͲΨ ͹͵Ǥ͵Ψ ͻ͸ǤͳΨ ͻͷǤ͹Ψ ͻͷǤͷΨ ͻ͹Ǥ͵Ψ ͻʹǤͷΨ ͻ͸Ǥ͹Ψ
1E-2
ISOLET ͻ͵ǤʹΨ ͻ͵ǤͷΨ ͵ͺǤͻΨ ͻ͵ǤͷΨ ͻ͵ǤͳΨ ͻͷǤͲΨ ͻ͸ǤͲΨ ͻʹǤʹΨ ͻͶǤͶΨ

GENERIC
GENERIC
RP

MLP

SVM

DNN

RP

MLP

SVM

RP

DNN
LR
KNN

RF

LR
KNN

RF
DNN
level-id
GENERIC

level-id
GENERIC

level-id
GENERIC
LANG ͺǤʹΨ ͹ͷǤͻΨ ͳͲͲǤͲΨ ͷ2Ǥ8Ψ ͳͲͲǤͲΨ ͷǤͶΨ ͵ͲǤͺΨ ͳͲǤ3Ψ ͻͻǤͻΨ
MNIST ͻͶǤ͸Ψ ͺͻǤͶΨ ͷ͵ǤͲΨ ͺͻǤ͵Ψ ͻͶǤͲΨ ͻ͸Ǥ͹Ψ ͻ͹ǤͻΨ ͻ͸ǤͲΨ ͻͻǤͳΨ
PAGE ͻ͸ǤͳΨ ͻͳǤ͸Ψ ͻͳǤ͹Ψ ͻͳǤ͹Ψ ͻͳǤͺΨ ͻ͸ǤͷΨ ͻ͸ǤͻΨ ͻ͹ǤͶΨ ͻͷǤͺΨ (a) Energy Consumption
PAMAP2 ͺ͵ǤͲΨ ͻͶǤ͸Ψ ͸ͲǤͻΨ ͻͷǤͺΨ ͻ͵ǤͺΨ ͻʹǤͻΨ ͻͳǤͻΨ ͻͷǤ͸Ψ ͻ͸ǤͳΨ
UCIHARͻ͵ǤͶΨ ͻͶǤ͸Ψ ͸ͶǤͻΨ ͻͶǤ͹Ψ ͻͶǤͻΨ ͻͶǤ͸Ψ ͻͷǤͺΨ ͻͷǤ͸Ψ ͻ͸ǤͷΨ Train Inference
1E+3
Mean ͹͹ǤͲΨ ͻͲǤͲΨ ͹͸ǤͺΨ ͺͺǤ3Ψ ͻ͵Ǥͷ
ͷΨ ͺʹǤͺΨ ͺ͹ǤͲΨ ͺͷǤ͵Ψ ͻʹǤͷΨ Time per input (mSec) Raspberry Pi CPU eGPU
STDV ʹ͹ǤͷΨ ͸ǤͻΨ ͳͻǤʹΨ ͳ2Ǥ4Ψ ͶǤͶΨ ʹ͸ǤͻΨ ͳͻǤͲΨ ʹͶǤ4Ψ ͳͲǤͺΨ 1E+2

1E+1
RP encoding fails in time-series datasets that require temporal in-
1E+0
formation (e.g., EEG). As explained in subsection 3.1, the ngram
encoding [6, 14] do not capture the global relation of the features, 1E-1

so it fails in datasets such as speech (ISOLET) and image recog- 1E-2

GENERIC
GENERIC
GENERIC
RP

MLP

SVM

DNN

RP

MLP

SVM

DNN

RP

DNN
LR
level-id

KNN

RF

LR
level-id

KNN

RF

level-id
GENERIC

GENERIC

GENERIC
nition (MNIST). Except for the ngram and the proposed GENERIC,
other HDC techniques fail in the LANG (text classification) as they
enforce capturing sequential information and ignore subsequences. (b) Execution Time
Figure 3: (a) Energy consumption and (b) execution time of
3.3 Efficiency on Conventional Hardware HDC and ML algorithms on different devices.
HDC’s operations are simple and highly parallelizable, however,
with 27× (111×) higher execution time than the most efficient base-
conventional processors are not optimized for binary operations
line (random forest). Nonetheless, eGPU numbers imply substantial
such as one-bit accumulation. Also, the size of hypervectors in most
energy and runtime reduction potential for HDC by effectively tak-
settings becomes larger than the cache size of low-end edge proces-
ing advantage of low-precision operations (achieved by bit-packing
sors, which may impose significant performance overhead. For a
in eGPU) and high parallelism.
detailed comparison, we implemented the HDC and ML algorithms
on the datasets of subsection 3.2 on a Raspberry Pi 3 embedded 4 GENERIC Architecture
processor and NVIDIA Jetson TX2 low-power edge GPU, and also
a desktop CPU (Intel Core i7-8700 at 3.2 GHz) with a larger cache. 4.1 Overview
We used Hioki 3334 power meter to measure the power of the Figure 4 shows the main components of GENERIC architecture.
Raspberry Pi. The main inputs include (i) input port to read an input (including
Figure 3 compares the training and inference (a) energy con- the label in case of training) from the serial interface element by
sumption and (b) execution time of the algorithms, reported as element and store in the input memory before starting the encoding,
the geometric mean of all benchmarks (for eGPU, we omitted the (ii) config port to load the level, 𝑖𝑑, and class hypervectors (in case
results of conventional ML as it performed worse than CPU for of offline training), and (iii) spec port to provide the application
a variety of libraries we examined). We can observe that (i) con- characteristics to the controller, such as Dℎ𝑣 dimensionality, 𝑑
ventional ML algorithms, including DNN, unanimously consume elements per input, 𝑛 length of window, 𝑛𝐶 number of classes or
smaller energy than HDC on all devices, (ii) GENERIC encoding, due centroids, 𝑏𝑤 effective bit-width, and mode (training, inference, or
to processing multiple hypervectors per window, is less efficient clustering). Output port returns the labels of inference or clustering.
than other HDC techniques, and (iii) our eGPU implementation, by The controller, by using spec data, handles the programmability
data packing (for parallel XOR) and memory reuse, significantly of GENERIC and orchestrates the operations. For instance, the en-
improves the HDC execution time and energy consumption. For coder generates 𝑚=16 (architectural constant) partial dimensions
instance, eGPU improves the energy usage and execution time of after each iteration over the stored input, where the variable Dℎ𝑣
GENERIC inference by 134× and 252× over running on low-end signals the end of encoding to finalize the search result, 𝑑 denotes
Raspberry Pi (70× and 30× over CPU). However, GENERIC running the number of input memory rows to be proceeded to fetch fea-
on eGPU still consumes 12× (3×) more inference (train) energy, tures (i.e., the exit condition for counter), 𝑛𝐶 indicates the number

1119
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing


ƐƉĞĐ ĐŽŶĨŝŐ ŵ
ŵŽĚĞ ĐĞŶƚƌĂů ĚĂƚĂ ƌĞŐŶ score is passed to the divider 9 . We use an approximate log-based
ůŽĂĚ
ƐƚĂƌƚ ĐŽŶƚƌŽůůĞƌ Đƚƌů
ůĂďĞů ůĞǀĞů
division [18].
фф ƌĞŐŶ−ϭ
Đƚƌů
ŵĞŵŽƌLJ 4.2.2 Training and Retraining: In the first round of training,
ŝŶƉƵƚ ϭ

• •
ĚĂƚĂ
ďŝŶ н ĂĚĚƌ i.e., model initialization, encoded inputs of the same class/label
фф ƌĞŐϭ
ŝŶƉƵƚ are accumulated. It is done through the adder 4 and mux 3
ŵĞŵŽƌLJ of all class memories. The controller uses the input label and the
Đƚƌů ĐŽŶĨŝŐ
Ϯ ƌĞŐŝĚ
ĂĚĚƌ
ŝŶƉƵƚ
ĚĂƚĂ ƚŵƉ iteration counter to activate the proper memory row. In the next
ŝĚ retraining epochs, the model is examined and updated in case of
ĐŽŶĨŝŐ ŵĞŵŽƌLJ
misprediction (see Figure 1). Thus, during retraining, meanwhile
Đƚƌů
н н
ƐƉĞĐ
Encoder ĂĚĚƌ ƌĞŐĞŶĐ performing inference on the training data, the encoded hypervector


is stored in temporary rows of the class memories (through the


second input of mux 3 ). If updating a class is required, the class
ϯ ϰ rows are read and latched in the adder 4 , followed by reading the
н н corresponding encoded dimensions from the temporary rows and
ϭ͕ϭ ϭ͕ŵ
Ϯ͕ϭ ŵĂƐŬ
× ĐŽŶĨŝŐ
Ϯ͕ŵ ŵĂƐŬ
× writing the new class dimensions back to the memory. Hence, each
ϱ
update takes 3× D𝑚ℎ𝑣 cycles. Training also requires calculating the
ĐŽŶĨŝŐ
ϭ͕ŵнϭ ϭ͕Ϯŵ


Đƚƌů Đƚƌů
ĂĚĚƌ Dϭ ĂĚĚƌ Dŵ squared L2 norm of classes in the norm2 memory 8 . As it can be
seen in Figure 4, the class memories are able to pass the output into
Search н ϲ ƐĐŽƌĞ
ϵ both ports of the multipliers (one direct and another through the
н × mux) to calculate and then accumulate the squared elements.
н
/s

ŽƵƚƉƵƚ
ϭ ϭ
Ϯ ĐŽŶĨŝŐ Ϯ 4.2.3 Clustering: GENERIC selects the first 𝑘 encoded inputs as
ϯ ϳ ϯ ϴ ŵĂdž
ĐŽŶĨŝŐ
Đƚƌů ƐĐŽƌĞ Đƚƌů ŶŽƌŵϮ the initial cluster centroids and initializes 𝑘 centroids in the class
ĂĚĚƌ ĂĚĚƌ
memories. It allocates two sets of memory rows for temporary data;
Figure 4: Overview of GENERIC architecture. one for the incoming encoding generated in the encoding module
and another for the copy centroids (as mentioned in Section 2.1,
of class memory rows that need to be read for dot-product and clustering generates a new copy instead of direct update). Similarity
so on. The class memory layout of GENERIC also allows trade off checking of the encoding dimensions with the centroids is done
between the hypervectors length 𝐷ℎ𝑣 and supported classes 𝑛𝐶 . By pipelined similar to inference, but the encoded dimensions are
default, GENERIC class memories can store 𝐷ℎ𝑣 =4K for up to 𝑛𝐶 =32 stored to be added to the copy centroid after finalizing the similarity
classes. For an application with less than 32 classes, higher num- checking. After finding the most similar centroid, the copy centroid
ber of dimensions can be used (e.g., 8K dimensions for 16 classes). is updated by adding the stored hypervector (similar to retraining).
We further discuss it in subsection 4.3. These application-specific The copy centroids serve as the new centroids in the next epoch.
input parameters enable GENERIC the flexibility to implement var-
4.3 Energy Reduction
ious applications without requiring a complex instruction set or
reconfigurable logic. We take advantage of the properties of GENERIC architecture
and HDC for utmost energy efficiency. The following elaborates
4.2 Classification and Clustering energy-saving techniques that benefit GENERIC. These techniques
can also be applied to other HDC accelerators.
4.2.1 Encoding and Inference: Features are fetched one by one 4.3.1 id Memory Compression: The 𝑖𝑑 memory naturally needs
from the input memory and quantized to obtain the level bin, and 1K×4K=512 KB (for up to to 1K features per input, and Dℎ𝑣 =4K
accordingly, 𝑚 (16) bits of the proper level hypervector are read. dimensions) which occupies a large area and consumes huge power.
The levels are stored as 𝑚-bit rows in the level memory. The stacked However, GENERIC generates 𝑖𝑑s on-the-fly using a seed 𝑖𝑑 vector,
registers (reg 𝑛 to 1) facilitate storing and on-the-fly sliding of level
where 𝑘 th 𝑖𝑑 is generated by permuting the seed 𝑖𝑑 by 𝑘 indexes.
hypervectors of a window. Each pass over the input features gener-
Therefore, the 𝑖𝑑 memory shrinks to 4 Kbit, i.e., 1024× reduction.
ates 𝑚 encoding dimensions, which are used for dot-product with

Permutation preserves the orthogonality. It is implemented by the
the classes. The class hypervectors are distributed into 𝑚 memories
tmp register in Figure 4 2 , by which, for a new window, the reg id
(CM 1 to CM 𝑚) to enable reading 𝑚 consecutive dimensions at once.
is right-shifted and one bit of tmp is shifted in. The tmp register


The dot-product of partial encoding with each class is summed up
helps to avoid frequent access to the 𝑖𝑑 memory by reading 𝑚 (16)


in the pipelined adder 6 , and accumulated with the dot-product
bits at once and feeding in the next 𝑚 cycles.
result of previous/next 𝑚 dimensions in the score memory 7 .
After D𝑚ℎ𝑣 iterations, all dimensions are generated, and the dot- 4.3.2 Application-opportunistic Power Gating: For an appli-
product scores are finalized. We use cosine similarity metric be- cation with 𝑛𝐶 classes and using Dℎ𝑣 dimensions, GENERIC stripes
the dimensions 1 to 𝑚 (16) of its 1st class vector in the 1st row of 𝑚
tween the encoding vector H and class C𝑖 : 𝛿𝑖 =  H H·×C𝑖C  ; hence,
2 𝑖 2 class memories, the 2nd class vector in the 2nd row, and so on (see
we need to normalize the dot-product result with L2 norms. The Figure 4). The next 𝑚 dimensions of the 1st class vector are there-
H  2 can be removed from the denominator as it is a constant fore written into 𝑛𝐶 + 1th row, followed by the other classes. Thus,
and does not affect the rank of classes. In addition, to eliminate
( H· C𝑖 ) 2 GENERIC always uses the first 𝑛𝐶32×4K
×Dℎ𝑣
portion of class memories.
the square root of C𝑖  2 , we modify the metric to 𝛿𝑖 =

2
 C𝑖  2 The applications of Section 3.2, on average, fill 28% of the class
without affecting the predictions. The norm2 memory of Figure 4 8 memories (minimum 6% for EEG/FACE, and maximum 81% for ISO-
stores the squared L2 norms of classes, and similarly, the squared LET) using Dℎ𝑣 =4K dimensions. Accordingly, GENERIC partitions

1120
GENERIC: Highly Efficient Learning Engine on Edge using Hyperdimensional Computing DAC ’22, July 10–14, 2022, San Francisco, CA, USA

‘•–ƒ– ’†ƒ–‡† ‘•–ƒ– ’†ƒ–‡† ͺ„ Ͷ„ ʹ„ ͺ„ Ͷ„ ʹ„


 ‘•–ƒ–  ’†ƒ–‡†  ‘•–ƒ–  ’†ƒ–‡† ͳ„ ’‘™‡”ሺ•ሻ ’‘™‡”ሺ†›ሻ ͳ„ ’‘™‡”ሺ•ሻ ’‘™‡”ሺ†›ሻ
ͻ͸Ψ ͳͲͲΨ ͳͲͲΨ ͹ൈ ͳͲͲΨ ͹ൈ
͸ൈ

‘™‡””‡†— –‹‘

‘™‡””‡†— –‹‘
ͻʹΨ ͻͲΨ ͻʹΨ ͸ൈ ͻʹΨ

—”ƒ ›
ͷൈ ͷൈ
 —”ƒ ›

 —”ƒ ›

—”ƒ ›
ͺͺΨ ͺͲΨ ͺͶΨ ͺͶΨ
Ͷൈ Ͷൈ
ͺͶΨ ͹ͲΨ ͹͸Ψ ͹͸Ψ


͵ൈ ͵ൈ


ͺͲΨ ͸ͲΨ ͸ͺΨ  ʹൈ ͸ͺΨ  ʹൈ
͹͸Ψ ͷͲΨ ͸ͲΨ ͳൈ ͸ͲΨ ͳൈ
Ͳ ͲǤͷ ͳ ͳǤͷ ʹ ʹǤͷ ͵ ͵Ǥͷ Ͷ Ͳ ͲǤͷ ͳ ͳǤͷ ʹ ʹǤͷ ͵ ͵Ǥͷ Ͷ ͲΨ ʹΨ ͶΨ ͸Ψ ͺΨ ͳͲΨ ͲΨ ʹΨ ͶΨ ͸Ψ ͺΨ ͳͲΨ
‹‡•‹‘•ሺൈ
ൈͳͲͲͲሻሻ ‹‡•‹‘•ሺൈ
ൈͳͲͲͲሻሻ ‹–””‘”
ƒ–‡ ‹–””‘”
ƒ–‡

Figure 5: Accuracy with constant and updated L2 norm. Figure 6: Accuracy and power reduction wrt memory error.
each class memory into four banks and power gates the unused ͹Ǥ͵Ψ
control
ϵ͘ϲй ϴ͘ϱй
datapath
banks. With four banks, 1.6 out of four banks are activated on av- ϵ͘ϰй
base mem
erage, leading to 59% power saving. With more fine-grained eight
feature mem
banks, 2.7 banks (out of eight) become active, saving 66% power. ϴϴ͘ϰй ͻͳǤʹΨ ϳϵ͘ϵй
level mem
However, eight banks impose 55% area overhead compared to 20% class mem
of four banks (see Section 5.1 for setup). We concluded that the four- (a) Area (b) Static Power (c) Dynamic Power
bank configuration yields the minimum area×power cost. Since the Figure 7: Accuracy and power reduction wrt memory error.
power gating is static (permanent) for an application, no wake-up
latency or energy is involved. Thus, voltage over-scaling for ISOLET is only preferred in work-
loads with a higher idle time where the static power dominates
4.3.3 On-demand Dimension Reduction: GENERIC can trade (voltage scaling reduces the static power more significantly).
the energy consumption and performance with accuracy. Recall that
GENERIC generates 𝑚 dimensions of the encoding per iteration over 5 Results
the features. By feeding a new Dℎ𝑣 value as input, GENERIC can 5.1 Setup
seamlessly use the new dimension count by updating the counter
exit condition, so smaller hypervectors of the encoding and class hy- We implemented GENERIC at the RTL level in SystemVerilog
pervectors will be used. Nevertheless, GENERIC stores the squared and verified the functionality in Modelsim. We used Synopsys De-
( H· C𝑖 ) 2 sign Compiler to synthesize GENERIC targeting 500 MHz clock with
L2 norms of the whole classes for similarity metric (𝛿𝑖 = ) 14 nm Standard Cell Library of GlobalFoundries. We used Artisan
 C𝑖  22
while for arbitrary reduced encoding dimensions, only the corre- memory compiler to generate the SRAM memories. The level mem-
sponding elements (and their L2 norms) of the classes are needed. ory has a total size of 64×4K = 32KB for 64 bins, the feature memory
As Figure 5 shows, using the old (Constant) L2 values causes signif- is 1024×8b, and class memories are 8K×16b (16 KB each). We ob-
icant accuracy loss compared to using the recomputed (Updated) tained the power consumption using Synopsys Power Compiler.
L2 norm of sub-hypervectors. The difference is up to 20.1% for EEG GENERIC occupies an area of 0.30 mm2 and consumes a worst-case
and 8.5% for ISOLET. To address this issue, when calculating the static power of 0.25 mW when all memory banks are active. For
squared L2 norms during the training, GENERIC stores the L2 norms datasets of Section 3.2, GENERIC consumes a static and dynamic
of every 128th -dimension sub-class in a different row of the norm2

power of 0.09 mW, and 1.79 mW, respectively (without voltage scal-
memory 8 .Thus, dimensions can be reduced with a granularity of ing). Figure 7 shows the area and power breakdown. Note that the
128 while keeping the norm2 memory small (2 KB for 32 classes). level memory contributes to less than 10% of area and power. Hence,
using more levels does not considerably affect the area or power.
4.3.4 Voltage Over-scaling: GENERIC has to use 16-bit class di-
mensions to support training. As a result, the large class memories 5.2 Classification Evaluation
consume ∼80% of the total power. HDC exhibits notable tolerance
5.2.1 Training: Since previous HDC ASICs have not reported
to the bit-flip of vectors [19], which can be leveraged to over-scale
training energy and performance, we compare the per-input energy
the memory voltage without performance loss. Figure 6 shows the
and execution time of GENERIC training with RF (random forest,
accuracy of select benchmarks (ISOLET and FACE) with respect
most efficient baseline) and SVM (most accurate conventional ML)
to the class memory error. The static (s) and dynamic (dyn) power
on CPU, and DNN and HDC on eGPU. Figure 8 shows the average
saving as a result of corresponding voltage scaling (without re-
energy and execution time for the datasets of Section 3.2. GENERIC
ducing clock cycle) is also shown in the right axis (based on the
improves the energy consumption by 528× over RF, 1257× over
measured data of [20]). The figure shows the result of the HDC
DNN, and 694× over HDC on eGPU (which, as discussed in Section

models with different bit-width (𝑏𝑤 input parameter of GENERIC)
3.3, is the most efficient baseline device for HDC). GENERIC con-
of classes by loading a quantized HDC model (the mask unit 5 in
sumes an average 2.06 mW of training power. It also has 11× faster
the architecture masks out the unused bits). As it can be seen, error
train time than DNN and 3.7× than HDC on eGPU. RF has 12×
tolerance not only depends on application but also on the bit-width.
smaller train time than GENERIC, but as we mentioned, the overall
1-bit FACE model shows a high degree of error tolerance (hence,
energy consumption of GENERIC is significantly (528×) smaller than
power saving) by up to 7% bit-flip error rate, while ISOLET pro-
RF. Also, we used constant 20 epochs for GENERIC training while
vides acceptable accuracy by up to 4% bit-flip using a 4-bit model.
the accuracy of most datasets saturates after a few epochs.
Quantized elements also reduce the dynamic power of dot-product.
Voltage over-scaling also depends on the application’s sensitivity 5.2.2 Inference: We compare the energy consumption of GENERIC
to dimension reduction and its workload. For instance, FACE has a inference with previous HDC platforms from Datta et al. [10], and
higher tolerance to voltage scaling than dimension reduction (see tiny-HD [8]. We scale their report numbers to 14 nm according to
Figure 5). On the other hand, ISOLET is more sensitive to voltage re- [21] for a fair comparison. We also include the RF (most efficient
duction but achieves good accuracy down to 1K dimensions (Figure ML), SVM (most-accurate ML) and DNN on HDC on eGPU (most-
5), which means 4× energy reduction compared to 4K dimensions. efficient HDC baseline). Figure 9 compares the energy consumption

1121
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing

1E+1 1E+2 GENERIC K-means (CPU) K-means (R-pi)


1E+4

Execution Time (mS)

Energy per input (uJ)


1E+0 1E+1
1E+3
Energy (mJ)

1E-1 1E+0 1E+2

1E-2 1E-1 1E+1


1E+0
1E-3 1E-2
GENERIC RF (CPU) SVM DNN HDC GENERIC RF (CPU) SVM DNN HDC 1E-1
(CPU) (eGPU) (eGPU) (CPU) (eGPU) (eGPU)
1E-2
Hepta Tetra TwoDiam. WingNut Iris
Figure 8: Training energy and execution time.
1E+4
Figure 10: GENERIC and K-means energy comparison.
1E+3
Acknowledgements
Energy/input (uJ)

1E+2
1E+1 This work was supported in part by CRISP, one of six centers in
1E+0
JUMP (an SRC program sponsored by DARPA), SRC Global Research
1E-1
1E-2
Collaboration (GRC) grant, and NSF grants #1911095, #1826967,
1E-3 #2100237, and #2112167. We would like to thank Amin Kalantar
and Onat Gungor for helping in Raspberry Pi experiments.

Figure 9: Inference energy of GENERIC vs baselines.


References
[1] A. Thomas, S. Dasgupta, and T. Rosing, “Theoretical foundations of hyperdimen-
Table 2: Mutual information score of K-means and HDC. sional computing,” Journal of Artificial Intelligence Research, vol. 72, pp. 215–249,
Hepta Tetra TwoDiamonds WingNut Iris 2021.
K-means 1.0 0.637 1.0 0.774 0.758 [2] L. Ge and K. K. Parhi, “Classification using hyperdimensional computing: A
review,” IEEE Circuits and Systems Magazine, vol. 20, no. 2, pp. 30–47, 2020.
HDC 0.904 0.589 0.981 0.781 0.760 [3] P. Kanerva, “Hyperdimensional computing: An introduction to computing in
distributed representation with high-dimensional random vectors,” Cognitive
of GENERIC and aforementioned baselines. Since GENERIC achieves computation, vol. 1, no. 2, pp. 139–159, 2009.
[4] R. Aitken, V. Chandra, J. Myers, B. Sandhu, L. Shifren, and G. Yeric, “Device and
significantly higher accuracy than previous work (e.g., 10.3% over technology implications of the internet of things,” in 2014 symposium on VLSI
[10]), GENERIC−LP applies the low-power techniques of Section technology (VLSI-technology): digest of technical papers, pp. 1–4, IEEE, 2014.
4.3 to leverage this accuracy benefit. GENERIC−LP improves the [5] X. Yu, X. Song, L. Cherkasova, and T. Š. Rosing, “Reliability-driven deployment
in energy-harvesting sensor networks,” in 2020 16th International Conference on
baseline GENERIC energy by 15.5× through dimension reduction Network and Service Management (CNSM), pp. 1–9, IEEE, 2020.
and voltage over-scaling. GENERIC−LP consumes 15.7× and 4.1× [6] A. Rahimi, P. Kanerva, et al., “A robust and energy-efficient classifier using brain-
less energy compared to [10] and tiny-HD [8], respectively. Note inspired hyperdimensional computing,” in International Symposium on Low Power
Electronics and Design, pp. 64–69, 2016.
that despite tiny-HD [8], GENERIC supports training which makes [7] A. Moin, A. Zhou, A. Rahimi, A. Menon, S. Benatti, G. Alexandrov, S. Tamakloe,
it to use larger memories. GENERIC is is 1593× and 8796× more et al., “A wearable biosensing system with in-sensor adaptive machine learning
energy-efficient than the most-efficient ML (RF) and eGPU-HDC, for hand gesture recognition,” Nature Electronics, vol. 4, no. 1, pp. 54–63, 2021.
[8] B. Khaleghi, H. Xu, J. Morris, and T. Š. Rosing, “tiny-hd: Ultra-efficient hyperdi-
respectively. mensional computing engine for iot applications,” in 2021 Design, Automation &
Test in Europe Conference & Exhibition (DATE), pp. 408–413, IEEE, 2021.
5.3 Clustering Evaluation [9] M. Eggimann, A. Rahimi, and L. Benini, “A 5𝜇 w standard cell memory-based con-
figurable hyperdimensional computing accelerator for always-on smart sensing,”
Table 2 compares the normalized mutual information score of the arXiv preprint arXiv:2102.02758, 2021.
K-means and HDC for the FCPS [22] benchmarks and the Iris flower [10] S. Datta et al., “A programmable hyper-dimensional processor architecture for
human-centric iot,” IEEE Journal on Emerging and Selected Topics in Circuits and
dataset. On average, K-means achieves slightly (0.031) higher score, Systems, vol. 9, no. 3, pp. 439–452, 2019.
but for datasets with more features, the proposed GENERIC can [11] F. Montagna, A. Rahimi, S. Benatti, D. Rossi, and L. Benini, “Pulp-hd: Accelerat-
better benefit from using windows (windows become less effective ing brain-inspired high-dimensional computing on a parallel ultra-low power
platform,” in 55th Design Automation Conference (DAC), pp. 1–6, IEEE, 2018.
in a smaller number of features). [12] G. Karunaratne, M. Le Gallo, G. Cherubini, L. Benini, et al., “In-memory hyperdi-
Figure 10 compares the per-input energy consumption of GENERIC mensional computing,” Nature Electronics, pp. 1–11, 2020.
with K-means clustering running on CPU and Raspberry Pi. GENERIC [13] M. Imani, Y. Kim, et al., “Hdcluster: An accurate clustering using brain-inspired
high-dimensional computing,” in Design, Automation & Test in Europe Conference
consumes only 0.068 𝜇J per input, which is 17,523× and 61,400× & Exhibition (DATE), pp. 1591–1594, IEEE, 2019.
more efficient than K-means on Raspberry Pi and CPU. The average [14] P. Alonso et al., “Hyperembed: Tradeoffs between resources and performance
in nlp tasks with hyperdimensional computing enabled embedding of n-gram
per-input execution time of Raspberry Pi and CPU is, respectively, statistics,” in International Joint Conference on Neural Networks, IEEE, 2021.
394 𝜇Sec and 248 𝜇Sec, while GENERIC achieves 9.6 𝜇Sec (41× and [15] “Uci machine learning repository.” https://archive.ics.uci.edu/ml/datasets/.
26× faster than R-Pi and CPU, respectively). [16] F. Pedregosa et al., “Scikit-learn: Machine learning in python,” the Journal of
machine Learning research, vol. 12, pp. 2825–2830, 2011.
[17] H. Jin, Q. Song, and X. Hu, “Auto-keras: An efficient neural architecture search
6 Conclusion system,” in 25th ACM SIGKDD International Conference on Knowledge Discovery
We proposed GENERIC, a highly-efficient HDC accelerator that & Data Mining, pp. 1946–1956, 2019.
[18] J. N. Mitchell, “Computer multiplication and division using binary logarithms,”
supports classification (inference and training) and clustering using IRE Transactions on Electronic Computers, no. 4, pp. 512–517, 1962.
a novel encoding technique that achieves 3.5% (6.5%) better accuracy [19] M. Imani, A. Rahimi, D. Kong, T. Rosing, et al., “Exploring hyperdimensional
compared to other HDC (ML) algorithms. GENERIC benefits from associative memory,” in International Symposium on High Performance Computer
Architecture (HPCA), pp. 445–456, IEEE, 2017.
power-gating, voltage over-scaling, and dimension reduction for [20] L. Yang and B. Murmann, “Sram voltage scaling for energy-efficient convolu-
utmost energy saving. Our results showed that GENERIC improves tional neural networks,” in International Symposium on Quality Electronic Design
(ISQED), pp. 7–12, IEEE, 2017.
the classification energy by 15.1× over a previous trainable HDC [21] A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of cmos
accelerator, and 4.1× over an inference-only accelerator. GENERIC device performance from 180 nm to 7 nm,” Integration, vol. 58, pp. 74–81, 2017.
HDC-based clustering consumes 17,523× lower energy with 41× [22] A. Ultsch, “Clustering with som: Uˆ* c,” in Proceedings of the workshop on self-
organizing maps, 2005, 2005.
higher performance than Raspberry Pi running K-means with simi-
lar accuracy, facilitating ultra-efficient continuous learning on edge.

1122

You might also like