0% found this document useful (0 votes)

16 views6 pages

DAC'22 - GENERIC - Highly Efficient Learning Engine On Edge Using Hyperdimensional Computing

The document presents GENERIC, a highly efficient learning engine utilizing Hyperdimensional Computing (HDC) for edge applications. It introduces a novel encoding method that enhances accuracy across diverse applications while also designing a flexible ASIC accelerator that significantly reduces energy consumption. Compared to previous HDC and machine learning techniques, GENERIC improves classification accuracy by 3.5% and 6.5%, respectively, while achieving a 4.1× reduction in energy consumption.

Uploaded by

zsq643382008

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views6 pages

DAC'22 - GENERIC - Highly Efficient Learning Engine On Edge Using Hyperdimensional Computing

Uploaded by

zsq643382008

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

GENERIC: Highly Efficient Learning Engine on Edge using

Hyperdimensional Computing
Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing
Department of Computer Science and Engineering, UC San Diego, La Jolla, CA 92093
{bkhaleghi, j5kang, hax032, j1morris, tajana}@ucsd.edu
Abstract HDC uses specific algorithms to encode raw inputs to a high-
dimensional representation of hypervectors with Dℎ𝑣 ≈2−5𝐾 di-
Hyperdimensional Computing (HDC) mimics the brain’s basic
mensions. The encoding takes place by deterministically associating
principles in performing cognitive tasks by encoding the data to
each element of an input with a binary or bipolar (±1) hypervec-
high-dimensional vectors and employing non-complex learning
tor and bundling (element-wise addition) the hypervectors of all
techniques. Conventional processing platforms such as CPUs and
elements to create the encoded hypervector. Training is straightfor-
GPUs are incapable of taking full advantage of the highly-parallel
ward and involves bundling all encoded hypervectors of the same
bit-level operations of HDC. On the other hand, existing HDC en-
category. For inference, the query input is encoded to a hypervector
coding techniques do not cover a broad range of applications to
in the same fashion and compared with all class hypervectors using
make a custom design plausible. In this paper, we first propose a
a simple similarity metric such as cosine.
novel encoding that achieves high accuracy for diverse applications.
The bit-level massively parallel operations of HDC do not accord
Thereafter, we leverage the proposed encoding and design a highly
well with conventional CPUs/GPUs due to, e.g., memory latency
efficient and flexible ASIC accelerator, dubbed GENERIC, suited for
and data movement of large vectors and the fact that these devices
the edge domain. GENERIC supports both classification (train and
are over-provisioned for majorly binary operations of HDC. Pre-
inference) and clustering for unsupervised learning on edge. Our
vious works on custom HDC accelerators support a limited range
design is flexible in the input size (hence it can run various appli-
of applications or achieve low accuracy. The authors of [6] and [7]
cations) and hypervectors dimensionality, allowing it to trade off
propose custom HDC inference designs that are limited to a specific
the accuracy and energy/performance on-demand. We augment
application. More flexible HDC inference ASICs are proposed in
GENERIC with application-opportunistic power-gating and voltage
[8] and [9], but as we quantify in Section 3.2, the utilized encoding
over-scaling (thanks to the notable error resiliency of HDC) for
techniques achieve poor accuracy for particular applications such
further energy reduction. GENERIC encoding improves the predic-
as time-series. The authors of [10] propose a trainable HDC acceler-
tion accuracy over previous HDC and ML techniques by 3.5% and
ator, which yields 9% lower accuracy than baseline ML algorithms.
6.5%, respectively. At 14 nm technology node, GENERIC occupies
An HDC-tailored processor is proposed in [11], but it consumes
an area of 0.30 mm2 , and consumes 0.09 mW static and 1.97 mW
∼1−2 orders of magnitude more energy than ASIC counterparts.
active power. Compared to the previous inference-only accelerator,
The in-memory HDC platform of [12] uses low-leakage PCM cells
GENERIC reduces the energy consumption by 4.1×.
to store hypervectors, but its CMOS peripherals throttle the overall
ACM Reference Format: efficiency.
Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Ros- In this paper, we propose GENERIC (highly efficient learning
ing. 2022. GENERIC: Highly Efficient Learning Engine on Edge using Hy-
engine on edge using hyperdimensional computing) for highly
perdimensional Computing. In Proceedings of the 59th ACM/IEEE Design
efficient and accurate trainable classification and clustering. Our
Automation Conference (DAC) (DAC ’22), July 10–14, 2022, San Francisco, CA,
USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3489517. primary goal is to make GENERIC compact and low-power to meet
3530669 year-long battery-powered operation, yet fast enough during train-
ing and burst inference, e.g., when it serves as an IoT gateway. To
this end, we make the following contributions.
1 Introduction (1) We propose a novel HDC encoding that yields high accuracy
Hyperdimensional Computing (HDC) is a novel brain-inspired in various benchmarks. Such a generic encoding is fundamental to
learning paradigm based on the observation that brains perform develop a custom yet flexible circuit.
cognitive tasks by mapping sensory inputs to high-dimensional (2) We perform a detailed comparison of HDC and various ML
neural representation [1–3]. It enables the brain to carry out sim- techniques on conventional devices and point out the failure of
ple, low-power, error-resilient, and parallelizable operations all in these devices in unleashing HDC advantages.
the hyperspace. Such characteristics of HDC make it appealing (3) We propose the GENERIC flexible architecture that implements
for a wide variety of applications such as IoT domain that gener- accurate HDC-based trainable classification and clustering.
ates an increasing amount of data with tight resource and energy (4) GENERIC benefits from extreme energy reduction techniques
constraints [4, 5]. such as application-opportunistic power gating, on-demand dimen-
sion reduction, and error-resilient voltage over-scaling.
Permission to make digital or hard copies of part or all of this work for personal or (5) Comparison of GENERIC with the state-of-the-art HDC imple-
classroom use is granted without fee provided that copies are not made or distributed mentations reveals GENERIC improves the classification accuracy
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. by 3.5% over previous HDC techniques and 6.5% over ML tech-
For all other uses, contact the owner/author(s). niques. GENERIC improves energy consumption by 4.1× and 15.7×
DAC ’22, July 10–14, 2022, San Francisco, CA, USA compared to previous HDC accelerators [8] and [10], respectively.
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9142-9/22/07.
https://doi.org/10.1145/3489517.3530669

1117
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing
Q
Encoded Hypervector Query Encoding -4 +9 0
Encoding +3 -4 +11 Encoding -4 +9 0 Subtract the query from
Similarity Check the incorrect class
HDC Model (Label = 4)
Label = 2 Label = ??? HDC Model
Class 1 − C1 = C1 − Q
Class 1 Score 1
Class 1 î Training Class 2
Training
Dataset
Class 2 + Test Class 2 î
Score 2 Dataset Class 3
Class 3 Dataset Score 3
Class 3 î Class 4 + C4 = C4 + Q
Class 4 Score 4
Class 4 î Predicted Class = 1 Add the query to the
Label = Max correct class
(a) (b) (c)
Figure 1: (a) HDC model training (initialization), (b) inference, and (c) retraining.
2 Hyperdimensional Computing generally need extracting local subsequences of consecutive fea-
2.1 Learning with HDC tures, without considering the global order of these subsequences
(see subsection 3.2). Previous studies use ngram encoding for such
Figure 1 demonstrates the HDC training and inference. During datasets [6, 7, 14]. Ngram encoding extracts all subsequences of
training, each input X is encoded to a hypervector H (X) and length 𝑛 (usually 𝑛∈{3−5}) in a given input, encodes all these subse-
added up to its class hypervector. In the inference, the query is quences and aggregates them to produce the encoded hypervector.
likewise encoded and compared with class hypervectors. The class However, ngram encoding achieves very low accuracy for datasets
index with the highest similarity score is returned as the prediction such as images or voices in which the spatio-temporal information
result. We use cosine distance of the query and class hypervectors of should be taken into account.
as the similarity metric. The accuracy of an HDC model can be We propose a new encoding, dubbed GENERIC, to cover a more
improved by retraining iterations where the encoded train data are versatile set of applications. As shown in Figure 2(d), our encoding
compared with the HDC model, and in case of misprediction, the processes sliding windows of length 𝑛 by applying the permuta-
model is updated by subtracting the encoded hypervector from the tion encoding. That is, for every window consisting of elements
mispredicted class and adding it to the correct class. {𝑥𝑘 , 𝑥𝑘+1, 𝑥𝑘+2 } (for 𝑛=3), three level hypervectors are selected,
The similarity of hypervectors indicates their proximity [1], where ℓ (𝑥𝑘 ), ℓ (𝑥𝑘+1 ), and ℓ (𝑥𝑘+2 ) are permuted by 0, 1, and 2 in-
which can be used to cluster data in the hyperspace [13]. Initially, dexes, respectively. The permuted hypervectors are XORed element-
𝑘 encoded hypervectors are selected as clusters centroids. At each wise to create the window hypervector. The permutation accounts
iteration, all encoded inputs are compared with the centroids and for positional information within a window, e.g., to distinguish
added to the closest (highest score) centroid hypervector. In classi- “abc” and “bca”. To account for global order of features, we associate
fication, the model is updated right away. However, in clustering, a random but constant id hypervector with each window, which
the model is fixed and used for finding the similarities, and a new is XORed with the window hypervector to perform binding. To
model is created from scratch, which replaces the current model in skip the global binding in certain applications, id hypervectors are
the next iteration.
set to {0} Dℎ𝑣 . Equation (1) formalizes our encoding, where 𝜌
( 𝑗)

2.2 Encoding indicates permutation by 𝑗 indexes, multiplies (XOR in binary)

Encoding is the major step of HDC; hence, previous works have the levels of 𝑖 th window, 𝑖𝑑𝑖 applies the binding 𝑖𝑑, and adds up
proposed several encoding techniques to map the inputs to high- the window hypervector for all windows of 𝑑 elements.
dimensional space. Most encodings associate hypervectors with 𝑑−𝑛+1 𝑛−1

the raw input features (elements), called level hypervector (see H (X) = 𝑖𝑑 𝑖 · 𝜌 ( 𝑗) ℓ (𝑥𝑖+𝑗 ) (1)
Figure 2(a)), which are hyperspace representative of scalar elements. 𝑖=1 𝑗=0
Usually, inputs are quantized into bins to limit the number of levels.
If there is a meaningful distance between the input elements (as in We use 𝑛=3 as it achieved the highest accuracy (on average) for
the values of white and black pixels), this distance is also preserved our examined benchmarks (see subsection 3.2), however, GENERIC
when generating the levels. architecture can adjust the value of 𝑛 for every application.
Encoding of an input is accomplished by aggregation the level
hypervectors of its elements. To handle the positional order of 3.2 Accuracy Comparison
elements, which is essential in most datasets such as image or voice,
HDC uses variants of binding. The permutation encoding of Figure We compiled eleven datasets from different domains, consisting
2(b) carries out binding by circular shift of the level hypervectors; of the benchmarks described in [10], seizure detection by skull
surface EEG signals, and user activity recognition by motion sensors
the level hypervector of 𝑚 th feature is permuted by 𝑚 indexes.
(PAMAP2) [15]. We implemented the HDC algorithms using an
Some other encodings such as random projection (RP), shown in
optimized Python implementation that leverages SIMD operations.
Figure 2(c), or level-id use id hypervectors for binding. In these
For ML techniques, we used Python scikit-learn library [16]. We
encodings, each input index has a random (but constant) binary id,
discarded the results of logistic regression and 𝑘-nearest neighbors
which is multiplied (XOR in the binary domain) with its level, and
as they achieved lower accuracy. For DNN models of benchmarks,
the result vector is aggregated with that of other indexes.
we used AutoKeras library [17] for automated model exploration.
3 Proposed HDC Encoding Table 1 summarizes the accuracy results (RP: random projec-
tion, MLP: multi-layer perceptron, SVM: support vector machine,
3.1 GENERIC Encoding RF: random forest). The proposed GENERIC encoding achieves 3.5%
The encoding techniques discussed in Section 2.2 achieve low higher accuracy than the best baseline HDC (level-id), 6.5% higher
accuracy for certain datasets such as language identification which than best baseline ML (SVM), and 1.0% higher than DNN. The

1118
GENERIC: Highly Efficient Learning Engine on Edge using Hyperdimensional Computing DAC ’22, July 10–14, 2022, San Francisco, CA, USA
Permuting and XORing window’s levels
Permuting level hypervectors Pixel value One id per each index
+1 -1 -1 +1 +1 +1 ... +1
+1 -1 -1 +1 +1 +1 ... +1 ͹ × -1 -1 +1 -1 +1 -1 ... +1 id0 ≪0
Level Hypervectors ≪0 ⊕
+
+1 -1 -1 +1 +1 +1 ... +1 + +1 -1 -1 +1 +1 +1 ... +1
≪1
-1 +1 -1 -1 +1 -1 ... +1
+1 +1 +1 -1 +1 -1 ... -1 241 × -1 +1 +1 -1 -1 +1 ... -1 idm ⊕
+1 +1 +1 -1 +1 -1 ... -1 ≪m -1 +1 -1 -1 +1 -1 ... +1
D = 5,000 + + ≪2
L ⋅L ≈0 ⊕
L ⋅ L ≈ 2500 -1 +1 -1 -1 +1 -1 ... +1 122 × -1 -1 +1 +1 +1 -1 ... +1 idn One id per -1 -1 +1 +1 +1 -1 ... +1 idk
window
L ⋅ L ≈ 2500
≪n +
+91 -3 -24 +61 -40 -71 ... +12
+9 0 -2 +1 -4 -7 ... +10
+12 -7 0 +4 -3 +2 ... -13
(a) (b) (c) (d)
Figure 2: (a) Level hypervectors, (b) permutation encoding, (c) random projection encoding, (d) proposed GENERIC encoding.
Table 1: Accuracy of HDC and ML algorithms. Train Inference
1E+3

Energy per input (mJ

Raspberry Pi CPU eGPU
HDC Algorithms ML Algorithms 1E+2
Dataset
RP level-id ngram permute GENERIC MLP SVM RF DNN
CARDIO ͺ͵ǤͲΨ ͺͺǤͳΨ ͺͺǤͳΨ ͺͺǤʹΨ ͻͳǤͺΨ ͺ͸ǤͶΨ ͺ͸ǤͶΨ ͻͷǤ͵Ψ ͻͲǤͳΨ 1E+1
DNA ͻͻǤ͵Ψ ͻͻǤ͵Ψ ͻͻǤ͹Ψ ͻͻǤ͵Ψ ͻͻǤ͹Ψ ͻͻǤͷΨ ͻͻǤͷΨ ͻͻǤͷΨ ͻͻǤͺΨ
1E+0
EEG Ͷ͸ǤͺΨ ͹͹ǤͷΨ ͺ͵ǤͳΨ ͹ͺǤ͵Ψ ͺ͵ǤǤͳΨ ͷ͸ǤͺΨ ͹ͷǤͶΨ ͺͲǤͳΨ ͸ͲǤʹΨ
EMG ͷ͵Ǥ͸Ψ ͻͲǤͻΨ ͻͲǤͺΨ ͻͳǤͳΨ ͻͲǤͻΨ ͻͳǤͲΨ ͺͻǤʹΨ ͺ͵Ǥ͸Ψ ͺͻǤͶΨ 1E-1
FACE ͻͷǤ͵Ψ ͻͷǤͲΨ ͹͵Ǥ͵Ψ ͻ͸ǤͳΨ ͻͷǤ͹Ψ ͻͷǤͷΨ ͻ͹Ǥ͵Ψ ͻʹǤͷΨ ͻ͸Ǥ͹Ψ
1E-2
ISOLET ͻ͵ǤʹΨ ͻ͵ǤͷΨ ͵ͺǤͻΨ ͻ͵ǤͷΨ ͻ͵ǤͳΨ ͻͷǤͲΨ ͻ͸ǤͲΨ ͻʹǤʹΨ ͻͶǤͶΨ

GENERIC
GENERIC
RP

MLP

SVM

DNN

MLP

SVM

DNN
LR
KNN

LR
KNN

RF
DNN
level-id
GENERIC

level-id
GENERIC

level-id
GENERIC
LANG ͺǤʹΨ ͹ͷǤͻΨ ͳͲͲǤͲΨ ͷ2Ǥ8Ψ ͳͲͲǤͲΨ ͷǤͶΨ ͵ͲǤͺΨ ͳͲǤ3Ψ ͻͻǤͻΨ
MNIST ͻͶǤ͸Ψ ͺͻǤͶΨ ͷ͵ǤͲΨ ͺͻǤ͵Ψ ͻͶǤͲΨ ͻ͸Ǥ͹Ψ ͻ͹ǤͻΨ ͻ͸ǤͲΨ ͻͻǤͳΨ
PAGE ͻ͸ǤͳΨ ͻͳǤ͸Ψ ͻͳǤ͹Ψ ͻͳǤ͹Ψ ͻͳǤͺΨ ͻ͸ǤͷΨ ͻ͸ǤͻΨ ͻ͹ǤͶΨ ͻͷǤͺΨ (a) Energy Consumption
PAMAP2 ͺ͵ǤͲΨ ͻͶǤ͸Ψ ͸ͲǤͻΨ ͻͷǤͺΨ ͻ͵ǤͺΨ ͻʹǤͻΨ ͻͳǤͻΨ ͻͷǤ͸Ψ ͻ͸ǤͳΨ
UCIHARͻ͵ǤͶΨ ͻͶǤ͸Ψ ͸ͶǤͻΨ ͻͶǤ͹Ψ ͻͶǤͻΨ ͻͶǤ͸Ψ ͻͷǤͺΨ ͻͷǤ͸Ψ ͻ͸ǤͷΨ Train Inference
1E+3
Mean ͹͹ǤͲΨ ͻͲǤͲΨ ͹͸ǤͺΨ ͺͺǤ3Ψ ͻ͵Ǥͷ
ͷΨ ͺʹǤͺΨ ͺ͹ǤͲΨ ͺͷǤ͵Ψ ͻʹǤͷΨ Time per input (mSec) Raspberry Pi CPU eGPU
STDV ʹ͹ǤͷΨ ͸ǤͻΨ ͳͻǤʹΨ ͳ2Ǥ4Ψ ͶǤͶΨ ʹ͸ǤͻΨ ͳͻǤͲΨ ʹͶǤ4Ψ ͳͲǤͺΨ 1E+2

1E+1
RP encoding fails in time-series datasets that require temporal in-
1E+0
formation (e.g., EEG). As explained in subsection 3.1, the ngram
encoding [6, 14] do not capture the global relation of the features, 1E-1

so it fails in datasets such as speech (ISOLET) and image recog- 1E-2

GENERIC
GENERIC
GENERIC
RP

MLP

SVM

DNN

MLP

SVM

DNN

DNN
LR
level-id

KNN

LR
level-id

KNN

level-id
GENERIC

GENERIC

GENERIC
nition (MNIST). Except for the ngram and the proposed GENERIC,
other HDC techniques fail in the LANG (text classification) as they
enforce capturing sequential information and ignore subsequences. (b) Execution Time
Figure 3: (a) Energy consumption and (b) execution time of
3.3 Efficiency on Conventional Hardware HDC and ML algorithms on different devices.
HDC’s operations are simple and highly parallelizable, however,
with 27× (111×) higher execution time than the most efficient base-
conventional processors are not optimized for binary operations
line (random forest). Nonetheless, eGPU numbers imply substantial
such as one-bit accumulation. Also, the size of hypervectors in most
energy and runtime reduction potential for HDC by effectively tak-
settings becomes larger than the cache size of low-end edge proces-
ing advantage of low-precision operations (achieved by bit-packing
sors, which may impose significant performance overhead. For a
in eGPU) and high parallelism.
detailed comparison, we implemented the HDC and ML algorithms
on the datasets of subsection 3.2 on a Raspberry Pi 3 embedded 4 GENERIC Architecture
processor and NVIDIA Jetson TX2 low-power edge GPU, and also
a desktop CPU (Intel Core i7-8700 at 3.2 GHz) with a larger cache. 4.1 Overview
We used Hioki 3334 power meter to measure the power of the Figure 4 shows the main components of GENERIC architecture.
Raspberry Pi. The main inputs include (i) input port to read an input (including
Figure 3 compares the training and inference (a) energy con- the label in case of training) from the serial interface element by
sumption and (b) execution time of the algorithms, reported as element and store in the input memory before starting the encoding,
the geometric mean of all benchmarks (for eGPU, we omitted the (ii) config port to load the level, 𝑖𝑑, and class hypervectors (in case
results of conventional ML as it performed worse than CPU for of offline training), and (iii) spec port to provide the application
a variety of libraries we examined). We can observe that (i) con- characteristics to the controller, such as Dℎ𝑣 dimensionality, 𝑑
ventional ML algorithms, including DNN, unanimously consume elements per input, 𝑛 length of window, 𝑛𝐶 number of classes or
smaller energy than HDC on all devices, (ii) GENERIC encoding, due centroids, 𝑏𝑤 effective bit-width, and mode (training, inference, or
to processing multiple hypervectors per window, is less efficient clustering). Output port returns the labels of inference or clustering.
than other HDC techniques, and (iii) our eGPU implementation, by The controller, by using spec data, handles the programmability
data packing (for parallel XOR) and memory reuse, significantly of GENERIC and orchestrates the operations. For instance, the en-
improves the HDC execution time and energy consumption. For coder generates 𝑚=16 (architectural constant) partial dimensions
instance, eGPU improves the energy usage and execution time of after each iteration over the stored input, where the variable Dℎ𝑣
GENERIC inference by 134× and 252× over running on low-end signals the end of encoding to finalize the search result, 𝑑 denotes
Raspberry Pi (70× and 30× over CPU). However, GENERIC running the number of input memory rows to be proceeded to fetch fea-
on eGPU still consumes 12× (3×) more inference (train) energy, tures (i.e., the exit condition for counter), 𝑛𝐶 indicates the number

1119
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing

•
ƐƉĞĐ ĐŽŶĨŝŐ ŵ
ŵŽĚĞ ĐĞŶƚƌĂů ĚĂƚĂ ƌĞŐŶ score is passed to the divider 9 . We use an approximate log-based
ůŽĂĚ
ƐƚĂƌƚ ĐŽŶƚƌŽůůĞƌ Đƚƌů
ůĂďĞů ůĞǀĞů
division [18].
фф ƌĞŐŶ−ϭ
Đƚƌů
ŵĞŵŽƌǇ 4.2.2 Training and Retraining: In the first round of training,
ŝŶƉƵƚ ϭ

• •
ĚĂƚĂ
ďŝŶ н ĂĚĚƌ i.e., model initialization, encoded inputs of the same class/label
фф ƌĞŐϭ
ŝŶƉƵƚ are accumulated. It is done through the adder 4 and mux 3
ŵĞŵŽƌǇ of all class memories. The controller uses the input label and the
Đƚƌů ĐŽŶĨŝŐ
Ϯ ƌĞŐŝĚ
ĂĚĚƌ
ŝŶƉƵƚ
ĚĂƚĂ ƚŵƉ iteration counter to activate the proper memory row. In the next
ŝĚ retraining epochs, the model is examined and updated in case of
ĐŽŶĨŝŐ ŵĞŵŽƌǇ
misprediction (see Figure 1). Thus, during retraining, meanwhile
Đƚƌů
н н
ƐƉĞĐ
Encoder ĂĚĚƌ ƌĞŐĞŶĐ performing inference on the training data, the encoded hypervector

•
is stored in temporary rows of the class memories (through the

•
second input of mux 3 ). If updating a class is required, the class
ϯ ϰ rows are read and latched in the adder 4 , followed by reading the
н н corresponding encoded dimensions from the temporary rows and
ϭ͕ϭ ϭ͕ŵ
Ϯ͕ϭ ŵĂƐŬ
× ĐŽŶĨŝŐ
Ϯ͕ŵ ŵĂƐŬ
× writing the new class dimensions back to the memory. Hence, each
ϱ
update takes 3× D𝑚ℎ𝑣 cycles. Training also requires calculating the
ĐŽŶĨŝŐ
ϭ͕ŵнϭ ϭ͕Ϯŵ

•
Đƚƌů Đƚƌů
ĂĚĚƌ Dϭ ĂĚĚƌ Dŵ squared L2 norm of classes in the norm2 memory 8 . As it can be
seen in Figure 4, the class memories are able to pass the output into
Search н ϲ ƐĐŽƌĞ
ϵ both ports of the multipliers (one direct and another through the
н × mux) to calculate and then accumulate the squared elements.
н
/s

ŽƵƚƉƵƚ
ϭ ϭ
Ϯ ĐŽŶĨŝŐ Ϯ 4.2.3 Clustering: GENERIC selects the first 𝑘 encoded inputs as
ϯ ϳ ϯ ϴ ŵĂǆ
ĐŽŶĨŝŐ
Đƚƌů ƐĐŽƌĞ Đƚƌů ŶŽƌŵϮ the initial cluster centroids and initializes 𝑘 centroids in the class
ĂĚĚƌ ĂĚĚƌ
memories. It allocates two sets of memory rows for temporary data;
Figure 4: Overview of GENERIC architecture. one for the incoming encoding generated in the encoding module
and another for the copy centroids (as mentioned in Section 2.1,
of class memory rows that need to be read for dot-product and clustering generates a new copy instead of direct update). Similarity
so on. The class memory layout of GENERIC also allows trade off checking of the encoding dimensions with the centroids is done
between the hypervectors length 𝐷ℎ𝑣 and supported classes 𝑛𝐶 . By pipelined similar to inference, but the encoded dimensions are
default, GENERIC class memories can store 𝐷ℎ𝑣 =4K for up to 𝑛𝐶 =32 stored to be added to the copy centroid after finalizing the similarity
classes. For an application with less than 32 classes, higher num- checking. After finding the most similar centroid, the copy centroid
ber of dimensions can be used (e.g., 8K dimensions for 16 classes). is updated by adding the stored hypervector (similar to retraining).
We further discuss it in subsection 4.3. These application-specific The copy centroids serve as the new centroids in the next epoch.
input parameters enable GENERIC the flexibility to implement var-
4.3 Energy Reduction
ious applications without requiring a complex instruction set or
reconfigurable logic. We take advantage of the properties of GENERIC architecture
and HDC for utmost energy efficiency. The following elaborates
4.2 Classification and Clustering energy-saving techniques that benefit GENERIC. These techniques
can also be applied to other HDC accelerators.
4.2.1 Encoding and Inference: Features are fetched one by one 4.3.1 id Memory Compression: The 𝑖𝑑 memory naturally needs
from the input memory and quantized to obtain the level bin, and 1K×4K=512 KB (for up to to 1K features per input, and Dℎ𝑣 =4K
accordingly, 𝑚 (16) bits of the proper level hypervector are read. dimensions) which occupies a large area and consumes huge power.
The levels are stored as 𝑚-bit rows in the level memory. The stacked However, GENERIC generates 𝑖𝑑s on-the-fly using a seed 𝑖𝑑 vector,
registers (reg 𝑛 to 1) facilitate storing and on-the-fly sliding of level
where 𝑘 th 𝑖𝑑 is generated by permuting the seed 𝑖𝑑 by 𝑘 indexes.
hypervectors of a window. Each pass over the input features gener-
Therefore, the 𝑖𝑑 memory shrinks to 4 Kbit, i.e., 1024× reduction.
ates 𝑚 encoding dimensions, which are used for dot-product with
•
Permutation preserves the orthogonality. It is implemented by the
the classes. The class hypervectors are distributed into 𝑚 memories
tmp register in Figure 4 2 , by which, for a new window, the reg id
(CM 1 to CM 𝑚) to enable reading 𝑚 consecutive dimensions at once.
is right-shifted and one bit of tmp is shifted in. The tmp register

•
The dot-product of partial encoding with each class is summed up
helps to avoid frequent access to the 𝑖𝑑 memory by reading 𝑚 (16)

•
in the pipelined adder 6 , and accumulated with the dot-product
bits at once and feeding in the next 𝑚 cycles.
result of previous/next 𝑚 dimensions in the score memory 7 .
After D𝑚ℎ𝑣 iterations, all dimensions are generated, and the dot- 4.3.2 Application-opportunistic Power Gating: For an appli-
product scores are finalized. We use cosine similarity metric be- cation with 𝑛𝐶 classes and using Dℎ𝑣 dimensions, GENERIC stripes
the dimensions 1 to 𝑚 (16) of its 1st class vector in the 1st row of 𝑚
tween the encoding vector H and class C𝑖 : 𝛿𝑖 = H H·×C𝑖C ; hence,
2 𝑖 2 class memories, the 2nd class vector in the 2nd row, and so on (see
we need to normalize the dot-product result with L2 norms. The Figure 4). The next 𝑚 dimensions of the 1st class vector are there-
H 2 can be removed from the denominator as it is a constant fore written into 𝑛𝐶 + 1th row, followed by the other classes. Thus,
and does not affect the rank of classes. In addition, to eliminate
( H· C𝑖 ) 2 GENERIC always uses the first 𝑛𝐶32×4K
×Dℎ𝑣
portion of class memories.
the square root of C𝑖 2 , we modify the metric to 𝛿𝑖 =
•
2
C𝑖 2 The applications of Section 3.2, on average, fill 28% of the class
without affecting the predictions. The norm2 memory of Figure 4 8 memories (minimum 6% for EEG/FACE, and maximum 81% for ISO-
stores the squared L2 norms of classes, and similarly, the squared LET) using Dℎ𝑣 =4K dimensions. Accordingly, GENERIC partitions

1120
GENERIC: Highly Efficient Learning Engine on Edge using Hyperdimensional Computing DAC ’22, July 10–14, 2022, San Francisco, CA, USA

ͺ Ͷ ʹ ͺ Ͷ ʹ

ͳ ሺሻ ሺሻ ͳ ሺሻ ሺሻ
ͻ͸Ψ ͳͲͲΨ ͳͲͲΨ ͹ൈ ͳͲͲΨ ͹ൈ
͸ൈ

ͻʹΨ ͻͲΨ ͻʹΨ ͸ൈ ͻʹΨ

ͷൈ ͷൈ

ͺͺΨ ͺͲΨ ͺͶΨ ͺͶΨ
Ͷൈ Ͷൈ
ͺͶΨ ͹ͲΨ ͹͸Ψ ͹͸Ψ

͵ൈ ͵ൈ

ͺͲΨ ͸ͲΨ ͸ͺΨ ʹൈ ͸ͺΨ ʹൈ
͹͸Ψ ͷͲΨ ͸ͲΨ ͳൈ ͸ͲΨ ͳൈ
Ͳ ͲǤͷ ͳ ͳǤͷ ʹ ʹǤͷ ͵ ͵Ǥͷ Ͷ Ͳ ͲǤͷ ͳ ͳǤͷ ʹ ʹǤͷ ͵ ͵Ǥͷ Ͷ ͲΨ ʹΨ ͶΨ ͸Ψ ͺΨ ͳͲΨ ͲΨ ʹΨ ͶΨ ͸Ψ ͺΨ ͳͲΨ
ሺൈ
ൈͳͲͲͲሻሻ ሺൈ
ൈͳͲͲͲሻሻ

Figure 5: Accuracy with constant and updated L2 norm. Figure 6: Accuracy and power reduction wrt memory error.
each class memory into four banks and power gates the unused ͹Ǥ͵Ψ
control
ϵ͘ϲй ϴ͘ϱй
datapath
banks. With four banks, 1.6 out of four banks are activated on av- ϵ͘ϰй
base mem
erage, leading to 59% power saving. With more fine-grained eight
feature mem
banks, 2.7 banks (out of eight) become active, saving 66% power. ϴϴ͘ϰй ͻͳǤʹΨ ϳϵ͘ϵй
level mem
However, eight banks impose 55% area overhead compared to 20% class mem
of four banks (see Section 5.1 for setup). We concluded that the four- (a) Area (b) Static Power (c) Dynamic Power
bank configuration yields the minimum area×power cost. Since the Figure 7: Accuracy and power reduction wrt memory error.
power gating is static (permanent) for an application, no wake-up
latency or energy is involved. Thus, voltage over-scaling for ISOLET is only preferred in work-
loads with a higher idle time where the static power dominates
4.3.3 On-demand Dimension Reduction: GENERIC can trade (voltage scaling reduces the static power more significantly).
the energy consumption and performance with accuracy. Recall that
GENERIC generates 𝑚 dimensions of the encoding per iteration over 5 Results
the features. By feeding a new Dℎ𝑣 value as input, GENERIC can 5.1 Setup
seamlessly use the new dimension count by updating the counter
exit condition, so smaller hypervectors of the encoding and class hy- We implemented GENERIC at the RTL level in SystemVerilog
pervectors will be used. Nevertheless, GENERIC stores the squared and verified the functionality in Modelsim. We used Synopsys De-
( H· C𝑖 ) 2 sign Compiler to synthesize GENERIC targeting 500 MHz clock with
L2 norms of the whole classes for similarity metric (𝛿𝑖 = ) 14 nm Standard Cell Library of GlobalFoundries. We used Artisan
C𝑖 22
while for arbitrary reduced encoding dimensions, only the corre- memory compiler to generate the SRAM memories. The level mem-
sponding elements (and their L2 norms) of the classes are needed. ory has a total size of 64×4K = 32KB for 64 bins, the feature memory
As Figure 5 shows, using the old (Constant) L2 values causes signif- is 1024×8b, and class memories are 8K×16b (16 KB each). We ob-
icant accuracy loss compared to using the recomputed (Updated) tained the power consumption using Synopsys Power Compiler.
L2 norm of sub-hypervectors. The difference is up to 20.1% for EEG GENERIC occupies an area of 0.30 mm2 and consumes a worst-case
and 8.5% for ISOLET. To address this issue, when calculating the static power of 0.25 mW when all memory banks are active. For
squared L2 norms during the training, GENERIC stores the L2 norms datasets of Section 3.2, GENERIC consumes a static and dynamic
of every 128th -dimension sub-class in a different row of the norm2
•
power of 0.09 mW, and 1.79 mW, respectively (without voltage scal-
memory 8 .Thus, dimensions can be reduced with a granularity of ing). Figure 7 shows the area and power breakdown. Note that the
128 while keeping the norm2 memory small (2 KB for 32 classes). level memory contributes to less than 10% of area and power. Hence,
using more levels does not considerably affect the area or power.
4.3.4 Voltage Over-scaling: GENERIC has to use 16-bit class di-
mensions to support training. As a result, the large class memories 5.2 Classification Evaluation
consume ∼80% of the total power. HDC exhibits notable tolerance
5.2.1 Training: Since previous HDC ASICs have not reported
to the bit-flip of vectors [19], which can be leveraged to over-scale
training energy and performance, we compare the per-input energy
the memory voltage without performance loss. Figure 6 shows the
and execution time of GENERIC training with RF (random forest,
accuracy of select benchmarks (ISOLET and FACE) with respect
most efficient baseline) and SVM (most accurate conventional ML)
to the class memory error. The static (s) and dynamic (dyn) power
on CPU, and DNN and HDC on eGPU. Figure 8 shows the average
saving as a result of corresponding voltage scaling (without re-
energy and execution time for the datasets of Section 3.2. GENERIC
ducing clock cycle) is also shown in the right axis (based on the
improves the energy consumption by 528× over RF, 1257× over
measured data of [20]). The figure shows the result of the HDC
DNN, and 694× over HDC on eGPU (which, as discussed in Section
•
models with different bit-width (𝑏𝑤 input parameter of GENERIC)
3.3, is the most efficient baseline device for HDC). GENERIC con-
of classes by loading a quantized HDC model (the mask unit 5 in
sumes an average 2.06 mW of training power. It also has 11× faster
the architecture masks out the unused bits). As it can be seen, error
train time than DNN and 3.7× than HDC on eGPU. RF has 12×
tolerance not only depends on application but also on the bit-width.
smaller train time than GENERIC, but as we mentioned, the overall
1-bit FACE model shows a high degree of error tolerance (hence,
energy consumption of GENERIC is significantly (528×) smaller than
power saving) by up to 7% bit-flip error rate, while ISOLET pro-
RF. Also, we used constant 20 epochs for GENERIC training while
vides acceptable accuracy by up to 4% bit-flip using a 4-bit model.
the accuracy of most datasets saturates after a few epochs.
Quantized elements also reduce the dynamic power of dot-product.
Voltage over-scaling also depends on the application’s sensitivity 5.2.2 Inference: We compare the energy consumption of GENERIC
to dimension reduction and its workload. For instance, FACE has a inference with previous HDC platforms from Datta et al. [10], and
higher tolerance to voltage scaling than dimension reduction (see tiny-HD [8]. We scale their report numbers to 14 nm according to
Figure 5). On the other hand, ISOLET is more sensitive to voltage re- [21] for a fair comparison. We also include the RF (most efficient
duction but achieves good accuracy down to 1K dimensions (Figure ML), SVM (most-accurate ML) and DNN on HDC on eGPU (most-
5), which means 4× energy reduction compared to 4K dimensions. efficient HDC baseline). Figure 9 compares the energy consumption

1121
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing

1E+1 1E+2 GENERIC K-means (CPU) K-means (R-pi)

1E+4

Execution Time (mS)

Energy per input (uJ)

1E+0 1E+1
1E+3
Energy (mJ)

1E-1 1E+0 1E+2

1E-2 1E-1 1E+1

1E+0
1E-3 1E-2
GENERIC RF (CPU) SVM DNN HDC GENERIC RF (CPU) SVM DNN HDC 1E-1
(CPU) (eGPU) (eGPU) (CPU) (eGPU) (eGPU)
1E-2
Hepta Tetra TwoDiam. WingNut Iris
Figure 8: Training energy and execution time.
1E+4
Figure 10: GENERIC and K-means energy comparison.
1E+3
Acknowledgements
Energy/input (uJ)

1E+2
1E+1 This work was supported in part by CRISP, one of six centers in
1E+0
JUMP (an SRC program sponsored by DARPA), SRC Global Research
1E-1
1E-2
Collaboration (GRC) grant, and NSF grants #1911095, #1826967,
1E-3 #2100237, and #2112167. We would like to thank Amin Kalantar
and Onat Gungor for helping in Raspberry Pi experiments.

Figure 9: Inference energy of GENERIC vs baselines.

References
[1] A. Thomas, S. Dasgupta, and T. Rosing, “Theoretical foundations of hyperdimen-
Table 2: Mutual information score of K-means and HDC. sional computing,” Journal of Artificial Intelligence Research, vol. 72, pp. 215–249,
Hepta Tetra TwoDiamonds WingNut Iris 2021.
K-means 1.0 0.637 1.0 0.774 0.758 [2] L. Ge and K. K. Parhi, “Classification using hyperdimensional computing: A
review,” IEEE Circuits and Systems Magazine, vol. 20, no. 2, pp. 30–47, 2020.
HDC 0.904 0.589 0.981 0.781 0.760 [3] P. Kanerva, “Hyperdimensional computing: An introduction to computing in
distributed representation with high-dimensional random vectors,” Cognitive
of GENERIC and aforementioned baselines. Since GENERIC achieves computation, vol. 1, no. 2, pp. 139–159, 2009.
[4] R. Aitken, V. Chandra, J. Myers, B. Sandhu, L. Shifren, and G. Yeric, “Device and
significantly higher accuracy than previous work (e.g., 10.3% over technology implications of the internet of things,” in 2014 symposium on VLSI
[10]), GENERIC−LP applies the low-power techniques of Section technology (VLSI-technology): digest of technical papers, pp. 1–4, IEEE, 2014.
4.3 to leverage this accuracy benefit. GENERIC−LP improves the [5] X. Yu, X. Song, L. Cherkasova, and T. Š. Rosing, “Reliability-driven deployment
in energy-harvesting sensor networks,” in 2020 16th International Conference on
baseline GENERIC energy by 15.5× through dimension reduction Network and Service Management (CNSM), pp. 1–9, IEEE, 2020.
and voltage over-scaling. GENERIC−LP consumes 15.7× and 4.1× [6] A. Rahimi, P. Kanerva, et al., “A robust and energy-efficient classifier using brain-
less energy compared to [10] and tiny-HD [8], respectively. Note inspired hyperdimensional computing,” in International Symposium on Low Power
Electronics and Design, pp. 64–69, 2016.
that despite tiny-HD [8], GENERIC supports training which makes [7] A. Moin, A. Zhou, A. Rahimi, A. Menon, S. Benatti, G. Alexandrov, S. Tamakloe,
it to use larger memories. GENERIC is is 1593× and 8796× more et al., “A wearable biosensing system with in-sensor adaptive machine learning
energy-efficient than the most-efficient ML (RF) and eGPU-HDC, for hand gesture recognition,” Nature Electronics, vol. 4, no. 1, pp. 54–63, 2021.
[8] B. Khaleghi, H. Xu, J. Morris, and T. Š. Rosing, “tiny-hd: Ultra-efficient hyperdi-
respectively. mensional computing engine for iot applications,” in 2021 Design, Automation &
Test in Europe Conference & Exhibition (DATE), pp. 408–413, IEEE, 2021.
5.3 Clustering Evaluation [9] M. Eggimann, A. Rahimi, and L. Benini, “A 5𝜇 w standard cell memory-based con-
figurable hyperdimensional computing accelerator for always-on smart sensing,”
Table 2 compares the normalized mutual information score of the arXiv preprint arXiv:2102.02758, 2021.
K-means and HDC for the FCPS [22] benchmarks and the Iris flower [10] S. Datta et al., “A programmable hyper-dimensional processor architecture for
human-centric iot,” IEEE Journal on Emerging and Selected Topics in Circuits and
dataset. On average, K-means achieves slightly (0.031) higher score, Systems, vol. 9, no. 3, pp. 439–452, 2019.
but for datasets with more features, the proposed GENERIC can [11] F. Montagna, A. Rahimi, S. Benatti, D. Rossi, and L. Benini, “Pulp-hd: Accelerat-
better benefit from using windows (windows become less effective ing brain-inspired high-dimensional computing on a parallel ultra-low power
platform,” in 55th Design Automation Conference (DAC), pp. 1–6, IEEE, 2018.
in a smaller number of features). [12] G. Karunaratne, M. Le Gallo, G. Cherubini, L. Benini, et al., “In-memory hyperdi-
Figure 10 compares the per-input energy consumption of GENERIC mensional computing,” Nature Electronics, pp. 1–11, 2020.
with K-means clustering running on CPU and Raspberry Pi. GENERIC [13] M. Imani, Y. Kim, et al., “Hdcluster: An accurate clustering using brain-inspired
high-dimensional computing,” in Design, Automation & Test in Europe Conference
consumes only 0.068 𝜇J per input, which is 17,523× and 61,400× & Exhibition (DATE), pp. 1591–1594, IEEE, 2019.
more efficient than K-means on Raspberry Pi and CPU. The average [14] P. Alonso et al., “Hyperembed: Tradeoffs between resources and performance
in nlp tasks with hyperdimensional computing enabled embedding of n-gram
per-input execution time of Raspberry Pi and CPU is, respectively, statistics,” in International Joint Conference on Neural Networks, IEEE, 2021.
394 𝜇Sec and 248 𝜇Sec, while GENERIC achieves 9.6 𝜇Sec (41× and [15] “Uci machine learning repository.” https://archive.ics.uci.edu/ml/datasets/.
26× faster than R-Pi and CPU, respectively). [16] F. Pedregosa et al., “Scikit-learn: Machine learning in python,” the Journal of
machine Learning research, vol. 12, pp. 2825–2830, 2011.
[17] H. Jin, Q. Song, and X. Hu, “Auto-keras: An efficient neural architecture search
6 Conclusion system,” in 25th ACM SIGKDD International Conference on Knowledge Discovery
We proposed GENERIC, a highly-efficient HDC accelerator that & Data Mining, pp. 1946–1956, 2019.
[18] J. N. Mitchell, “Computer multiplication and division using binary logarithms,”
supports classification (inference and training) and clustering using IRE Transactions on Electronic Computers, no. 4, pp. 512–517, 1962.
a novel encoding technique that achieves 3.5% (6.5%) better accuracy [19] M. Imani, A. Rahimi, D. Kong, T. Rosing, et al., “Exploring hyperdimensional
compared to other HDC (ML) algorithms. GENERIC benefits from associative memory,” in International Symposium on High Performance Computer
Architecture (HPCA), pp. 445–456, IEEE, 2017.
power-gating, voltage over-scaling, and dimension reduction for [20] L. Yang and B. Murmann, “Sram voltage scaling for energy-efficient convolu-
utmost energy saving. Our results showed that GENERIC improves tional neural networks,” in International Symposium on Quality Electronic Design
(ISQED), pp. 7–12, IEEE, 2017.
the classification energy by 15.1× over a previous trainable HDC [21] A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of cmos
accelerator, and 4.1× over an inference-only accelerator. GENERIC device performance from 180 nm to 7 nm,” Integration, vol. 58, pp. 74–81, 2017.
HDC-based clustering consumes 17,523× lower energy with 41× [22] A. Ultsch, “Clustering with som: Uˆ* c,” in Proceedings of the workshop on self-
organizing maps, 2005, 2005.
higher performance than Raspberry Pi running K-means with simi-
lar accuracy, facilitating ultra-efficient continuous learning on edge.

1122

Energy-Efficient Deep Learning Inference On Edge Devices
No ratings yet
Energy-Efficient Deep Learning Inference On Edge Devices
55 pages
(MoM) Pooja Singh
33% (3)
(MoM) Pooja Singh
41 pages
PL-NPU An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing
No ratings yet
PL-NPU An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing
14 pages
Accelerating Chip Design With Machine Learning
No ratings yet
Accelerating Chip Design With Machine Learning
10 pages
Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN
No ratings yet
Darbon - Overcoming The Curse of Dimensionality For Some Hamilton-Jacobi Pde Via NN
50 pages
(Ebook) SuperBrain Yoga by Master Choa Kok Sui PDF
100% (10)
(Ebook) SuperBrain Yoga by Master Choa Kok Sui PDF
81 pages
Ali Thesis
No ratings yet
Ali Thesis
125 pages
分层深度学习神经网络（HiDeNN）：用于计算科学与工程的人工智能框架
No ratings yet
分层深度学习神经网络（HiDeNN）：用于计算科学与工程的人工智能框架
39 pages
A Survey of Deep Learning
No ratings yet
A Survey of Deep Learning
21 pages
Binary Hyperdimensional Classification 2018
No ratings yet
Binary Hyperdimensional Classification 2018
19 pages
Exploring Neuromorphic Computing Based On Spiking Neural Networks: Algorithms To Hardware
No ratings yet
Exploring Neuromorphic Computing Based On Spiking Neural Networks: Algorithms To Hardware
49 pages
Neuromorphic Computing-From Devices To Integratedcircuits
No ratings yet
Neuromorphic Computing-From Devices To Integratedcircuits
21 pages
SIAMCSE
No ratings yet
SIAMCSE
47 pages
Hardware Approximate Techniques For Deep Neural Network Accelerators: A Survey
No ratings yet
Hardware Approximate Techniques For Deep Neural Network Accelerators: A Survey
36 pages
Federated Learning Via Over-the-Air Computation
No ratings yet
Federated Learning Via Over-the-Air Computation
14 pages
Implementation of Deep Neural Network Using VLSI B
No ratings yet
Implementation of Deep Neural Network Using VLSI B
8 pages
(IJCST-V10I3P2) :yew Kee Wong
No ratings yet
(IJCST-V10I3P2) :yew Kee Wong
11 pages
An - Energy-Efficient - Mixed-Bitwidth - Systolic - Accelerator - For - NAS-Optimized - Deep - Neural - Networks
No ratings yet
An - Energy-Efficient - Mixed-Bitwidth - Systolic - Accelerator - For - NAS-Optimized - Deep - Neural - Networks
13 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
Approximate Computing Part1
No ratings yet
Approximate Computing Part1
34 pages
Revisiting Hyperdimensional Learning For Fpga and Low-Power Architectures
No ratings yet
Revisiting Hyperdimensional Learning For Fpga and Low-Power Architectures
14 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
Angulargrad: A New Optimization Technique For Angular Convergence of Convolutional Neural Networks
No ratings yet
Angulargrad: A New Optimization Technique For Angular Convergence of Convolutional Neural Networks
14 pages
W Pg#s
No ratings yet
W Pg#s
31 pages
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
No ratings yet
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
13 pages
Approximate Computing Concepts Architectures Challenges Applications and Future Directions
No ratings yet
Approximate Computing Concepts Architectures Challenges Applications and Future Directions
67 pages
Hybrid In-Memory Computing Architecture For The Training of Deep Neural Networks
No ratings yet
Hybrid In-Memory Computing Architecture For The Training of Deep Neural Networks
5 pages
Fully Parallel Stochastic Computing Hardware Implementation of Convolutional Neural Networks For Edge Computing Applications!
No ratings yet
Fully Parallel Stochastic Computing Hardware Implementation of Convolutional Neural Networks For Edge Computing Applications!
11 pages
1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators
No ratings yet
1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators
14 pages
High-Throughput Near-Memory Processing On Cnns With 3D Hbm-Like Memory
No ratings yet
High-Throughput Near-Memory Processing On Cnns With 3D Hbm-Like Memory
20 pages
2004.11204v1 2
No ratings yet
2004.11204v1 2
16 pages
Weight-Oriented Approximation For Energy-Efficient Neural Network Inference Accelerators
No ratings yet
Weight-Oriented Approximation For Energy-Efficient Neural Network Inference Accelerators
14 pages
Vesti Energy-Efficient In-Memory Computing Accelerator For Deep Neural Networks
No ratings yet
Vesti Energy-Efficient In-Memory Computing Accelerator For Deep Neural Networks
14 pages
Brain-Inspired Hyperdimensional Computing For Ultra-Efficient Edge AI
No ratings yet
Brain-Inspired Hyperdimensional Computing For Ultra-Efficient Edge AI
10 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
1 s2.0 S2095809921003349 Main
No ratings yet
1 s2.0 S2095809921003349 Main
13 pages
10.1515 - Nanoph 2020 0297
No ratings yet
10.1515 - Nanoph 2020 0297
12 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
Introducing Neuromorphic Computing and Engineerin
No ratings yet
Introducing Neuromorphic Computing and Engineerin
8 pages
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
No ratings yet
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
11 pages
Digital Twin - Old Wine in A New Bottle
No ratings yet
Digital Twin - Old Wine in A New Bottle
20 pages
Ybrid Verestimating Pproximate Dder For Enhanced Performance Processing Engine
No ratings yet
Ybrid Verestimating Pproximate Dder For Enhanced Performance Processing Engine
6 pages
The Development of General-Purpose Brain-Inspired Computing
No ratings yet
The Development of General-Purpose Brain-Inspired Computing
12 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
Embedded Deep Learning Accelerators A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators A Survey On Recent Advances
19 pages
L 0017398760 PDF
No ratings yet
L 0017398760 PDF
24 pages
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
No ratings yet
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
4 pages
Prospects For Analog Circuits in Deep Networks: Preprint in AACD 2021 Workshop Proceedings
No ratings yet
Prospects For Analog Circuits in Deep Networks: Preprint in AACD 2021 Workshop Proceedings
6 pages
A Binary Learning Framework For Hyperdimensional Computing: Cosine
No ratings yet
A Binary Learning Framework For Hyperdimensional Computing: Cosine
6 pages
A Semi-Holographic Hyperdimensional Representation System For Hardware-Friendly Cognitive Computing
No ratings yet
A Semi-Holographic Hyperdimensional Representation System For Hardware-Friendly Cognitive Computing
8 pages
Physics-Guided Physics-Informed and Physics-Encode
No ratings yet
Physics-Guided Physics-Informed and Physics-Encode
37 pages
Unleashing The Potential of Alternative Deep Learning Hardware - EE Times
No ratings yet
Unleashing The Potential of Alternative Deep Learning Hardware - EE Times
5 pages
10.1038@s41563 019 0291 X
No ratings yet
10.1038@s41563 019 0291 X
15 pages
Generative Ai
No ratings yet
Generative Ai
15 pages
DSG Pinout - Passat PDF
100% (2)
DSG Pinout - Passat PDF
5 pages
Tianjic A Unified and Scalable Chip Bridging Spike-Based and Continuous Neural Computation
No ratings yet
Tianjic A Unified and Scalable Chip Bridging Spike-Based and Continuous Neural Computation
19 pages
Fully On-Chip MAC at 14 NM Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format
No ratings yet
Fully On-Chip MAC at 14 NM Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format
8 pages
Eco Kitchen
100% (2)
Eco Kitchen
36 pages
BoostLi Energy Storage Module ESM-4875A1 User Manual
100% (2)
BoostLi Energy Storage Module ESM-4875A1 User Manual
29 pages
Geoinformatics For Marine and Coastal Management
100% (2)
Geoinformatics For Marine and Coastal Management
444 pages
H&M PDF
100% (2)
H&M PDF
21 pages
2021 Super Duty Chassis Cab Tech Specs
100% (1)
2021 Super Duty Chassis Cab Tech Specs
9 pages
Greek (Pythagoras and Plato)
100% (1)
Greek (Pythagoras and Plato)
29 pages
Final Feasibility
100% (3)
Final Feasibility
26 pages
Cumene Prices
No ratings yet
Cumene Prices
3 pages
Analysis of Mer Calculation Techniques: - Jcptbb-65
No ratings yet
Analysis of Mer Calculation Techniques: - Jcptbb-65
7 pages
CHPC PDF
No ratings yet
CHPC PDF
6 pages
Respiratory Rehabilitation
No ratings yet
Respiratory Rehabilitation
26 pages
Digimon World Data Squad Walk Through
No ratings yet
Digimon World Data Squad Walk Through
15 pages
What Are Illegal Drugs?
No ratings yet
What Are Illegal Drugs?
6 pages
Grade 5 Nonrenewable Resources Word Search
No ratings yet
Grade 5 Nonrenewable Resources Word Search
2 pages
Smib TLab
No ratings yet
Smib TLab
13 pages
Kristen Steffen 11
No ratings yet
Kristen Steffen 11
4 pages
Sigmoid Deep Learning
No ratings yet
Sigmoid Deep Learning
8 pages
Unit 2: 2.1 - Mixtures, Solutions, Sovents
No ratings yet
Unit 2: 2.1 - Mixtures, Solutions, Sovents
5 pages
Admixtues Module 1
No ratings yet
Admixtues Module 1
9 pages
The Life and Work of Niels Bohr - A Brief Sketch: N Mukunda
No ratings yet
The Life and Work of Niels Bohr - A Brief Sketch: N Mukunda
8 pages
Beovision10 Gettingstarted English PDF
No ratings yet
Beovision10 Gettingstarted English PDF
32 pages
생명과학1.2-2.소화계,순환계,호흡계 - 정리노트 ?
No ratings yet
생명과학1.2-2.소화계,순환계,호흡계 - 정리노트 ?
1 page
Air Demand of A Hydraulic Jump in A Closed Conduit
No ratings yet
Air Demand of A Hydraulic Jump in A Closed Conduit
13 pages
Bisleri Brand Equity1 PDF
No ratings yet
Bisleri Brand Equity1 PDF
2 pages
Persimmon Pudding
No ratings yet
Persimmon Pudding
2 pages
Densified Wooden Nails For New Timber Assemblies and Restoration Works - A Pilot Research
No ratings yet
Densified Wooden Nails For New Timber Assemblies and Restoration Works - A Pilot Research
9 pages
Manual
No ratings yet
Manual
8 pages
Programming AI Workloads with Habana Gaudi SDK: The Complete Guide for Developers and Engineers
From Everand
Programming AI Workloads with Habana Gaudi SDK: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Mastering CUDA C Programming
From Everand
Mastering CUDA C Programming
Ed Norex
No ratings yet
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

DAC'22 - GENERIC - Highly Efficient Learning Engine On Edge Using Hyperdimensional Computing

Uploaded by

DAC'22 - GENERIC - Highly Efficient Learning Engine On Edge Using Hyperdimensional Computing

Uploaded by

GENERIC: Highly Efficient Learning Engine on Edge using

2.2 Encoding indicates permutation by 𝑗 indexes, multiplies (XOR in binary)

Energy per input (mJ

so it fails in datasets such as speech (ISOLET) and image recog- 1E-2

    ͺ Ͷ ʹ ͺ Ͷ ʹ

1E+1 1E+2 GENERIC K-means (CPU) K-means (R-pi)

Execution Time (mS)

Energy per input (uJ)

1E-1 1E+0 1E+2

1E-2 1E-1 1E+1

Figure 9: Inference energy of GENERIC vs baselines.

You might also like

ͺ Ͷ ʹ ͺ Ͷ ʹ