DAC'22 - GENERIC - Highly Efficient Learning Engine On Edge Using Hyperdimensional Computing
DAC'22 - GENERIC - Highly Efficient Learning Engine On Edge Using Hyperdimensional Computing
Hyperdimensional Computing
Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing
Department of Computer Science and Engineering, UC San Diego, La Jolla, CA 92093
{bkhaleghi, j5kang, hax032, j1morris, tajana}@ucsd.edu
Abstract HDC uses specific algorithms to encode raw inputs to a high-
dimensional representation of hypervectors with Dℎ𝑣 ≈2−5𝐾 di-
Hyperdimensional Computing (HDC) mimics the brain’s basic
mensions. The encoding takes place by deterministically associating
principles in performing cognitive tasks by encoding the data to
each element of an input with a binary or bipolar (±1) hypervec-
high-dimensional vectors and employing non-complex learning
tor and bundling (element-wise addition) the hypervectors of all
techniques. Conventional processing platforms such as CPUs and
elements to create the encoded hypervector. Training is straightfor-
GPUs are incapable of taking full advantage of the highly-parallel
ward and involves bundling all encoded hypervectors of the same
bit-level operations of HDC. On the other hand, existing HDC en-
category. For inference, the query input is encoded to a hypervector
coding techniques do not cover a broad range of applications to
in the same fashion and compared with all class hypervectors using
make a custom design plausible. In this paper, we first propose a
a simple similarity metric such as cosine.
novel encoding that achieves high accuracy for diverse applications.
The bit-level massively parallel operations of HDC do not accord
Thereafter, we leverage the proposed encoding and design a highly
well with conventional CPUs/GPUs due to, e.g., memory latency
efficient and flexible ASIC accelerator, dubbed GENERIC, suited for
and data movement of large vectors and the fact that these devices
the edge domain. GENERIC supports both classification (train and
are over-provisioned for majorly binary operations of HDC. Pre-
inference) and clustering for unsupervised learning on edge. Our
vious works on custom HDC accelerators support a limited range
design is flexible in the input size (hence it can run various appli-
of applications or achieve low accuracy. The authors of [6] and [7]
cations) and hypervectors dimensionality, allowing it to trade off
propose custom HDC inference designs that are limited to a specific
the accuracy and energy/performance on-demand. We augment
application. More flexible HDC inference ASICs are proposed in
GENERIC with application-opportunistic power-gating and voltage
[8] and [9], but as we quantify in Section 3.2, the utilized encoding
over-scaling (thanks to the notable error resiliency of HDC) for
techniques achieve poor accuracy for particular applications such
further energy reduction. GENERIC encoding improves the predic-
as time-series. The authors of [10] propose a trainable HDC acceler-
tion accuracy over previous HDC and ML techniques by 3.5% and
ator, which yields 9% lower accuracy than baseline ML algorithms.
6.5%, respectively. At 14 nm technology node, GENERIC occupies
An HDC-tailored processor is proposed in [11], but it consumes
an area of 0.30 mm2 , and consumes 0.09 mW static and 1.97 mW
∼1−2 orders of magnitude more energy than ASIC counterparts.
active power. Compared to the previous inference-only accelerator,
The in-memory HDC platform of [12] uses low-leakage PCM cells
GENERIC reduces the energy consumption by 4.1×.
to store hypervectors, but its CMOS peripherals throttle the overall
ACM Reference Format: efficiency.
Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Ros- In this paper, we propose GENERIC (highly efficient learning
ing. 2022. GENERIC: Highly Efficient Learning Engine on Edge using Hy-
engine on edge using hyperdimensional computing) for highly
perdimensional Computing. In Proceedings of the 59th ACM/IEEE Design
efficient and accurate trainable classification and clustering. Our
Automation Conference (DAC) (DAC ’22), July 10–14, 2022, San Francisco, CA,
USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3489517. primary goal is to make GENERIC compact and low-power to meet
3530669 year-long battery-powered operation, yet fast enough during train-
ing and burst inference, e.g., when it serves as an IoT gateway. To
this end, we make the following contributions.
1 Introduction (1) We propose a novel HDC encoding that yields high accuracy
Hyperdimensional Computing (HDC) is a novel brain-inspired in various benchmarks. Such a generic encoding is fundamental to
learning paradigm based on the observation that brains perform develop a custom yet flexible circuit.
cognitive tasks by mapping sensory inputs to high-dimensional (2) We perform a detailed comparison of HDC and various ML
neural representation [1–3]. It enables the brain to carry out sim- techniques on conventional devices and point out the failure of
ple, low-power, error-resilient, and parallelizable operations all in these devices in unleashing HDC advantages.
the hyperspace. Such characteristics of HDC make it appealing (3) We propose the GENERIC flexible architecture that implements
for a wide variety of applications such as IoT domain that gener- accurate HDC-based trainable classification and clustering.
ates an increasing amount of data with tight resource and energy (4) GENERIC benefits from extreme energy reduction techniques
constraints [4, 5]. such as application-opportunistic power gating, on-demand dimen-
sion reduction, and error-resilient voltage over-scaling.
Permission to make digital or hard copies of part or all of this work for personal or (5) Comparison of GENERIC with the state-of-the-art HDC imple-
classroom use is granted without fee provided that copies are not made or distributed mentations reveals GENERIC improves the classification accuracy
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. by 3.5% over previous HDC techniques and 6.5% over ML tech-
For all other uses, contact the owner/author(s). niques. GENERIC improves energy consumption by 4.1× and 15.7×
DAC ’22, July 10–14, 2022, San Francisco, CA, USA compared to previous HDC accelerators [8] and [10], respectively.
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9142-9/22/07.
https://doi.org/10.1145/3489517.3530669
1117
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing
Q
Encoded Hypervector Query Encoding -4 +9 0
Encoding +3 -4 +11 Encoding -4 +9 0 Subtract the query from
Similarity Check the incorrect class
HDC Model (Label = 4)
Label = 2 Label = ??? HDC Model
Class 1 − C1 = C1 − Q
Class 1 Score 1
Class 1 î Training Class 2
Training
Dataset
Class 2 + Test Class 2 î
Score 2 Dataset Class 3
Class 3 Dataset Score 3
Class 3 î Class 4 + C4 = C4 + Q
Class 4 Score 4
Class 4 î Predicted Class = 1 Add the query to the
Label = Max correct class
(a) (b) (c)
Figure 1: (a) HDC model training (initialization), (b) inference, and (c) retraining.
2 Hyperdimensional Computing generally need extracting local subsequences of consecutive fea-
2.1 Learning with HDC tures, without considering the global order of these subsequences
(see subsection 3.2). Previous studies use ngram encoding for such
Figure 1 demonstrates the HDC training and inference. During datasets [6, 7, 14]. Ngram encoding extracts all subsequences of
training, each input X is encoded to a hypervector H (X) and length 𝑛 (usually 𝑛∈{3−5}) in a given input, encodes all these subse-
added up to its class hypervector. In the inference, the query is quences and aggregates them to produce the encoded hypervector.
likewise encoded and compared with class hypervectors. The class However, ngram encoding achieves very low accuracy for datasets
index with the highest similarity score is returned as the prediction such as images or voices in which the spatio-temporal information
result. We use cosine distance of the query and class hypervectors of should be taken into account.
as the similarity metric. The accuracy of an HDC model can be We propose a new encoding, dubbed GENERIC, to cover a more
improved by retraining iterations where the encoded train data are versatile set of applications. As shown in Figure 2(d), our encoding
compared with the HDC model, and in case of misprediction, the processes sliding windows of length 𝑛 by applying the permuta-
model is updated by subtracting the encoded hypervector from the tion encoding. That is, for every window consisting of elements
mispredicted class and adding it to the correct class. {𝑥𝑘 , 𝑥𝑘+1, 𝑥𝑘+2 } (for 𝑛=3), three level hypervectors are selected,
The similarity of hypervectors indicates their proximity [1], where ℓ (𝑥𝑘 ), ℓ (𝑥𝑘+1 ), and ℓ (𝑥𝑘+2 ) are permuted by 0, 1, and 2 in-
which can be used to cluster data in the hyperspace [13]. Initially, dexes, respectively. The permuted hypervectors are XORed element-
𝑘 encoded hypervectors are selected as clusters centroids. At each wise to create the window hypervector. The permutation accounts
iteration, all encoded inputs are compared with the centroids and for positional information within a window, e.g., to distinguish
added to the closest (highest score) centroid hypervector. In classi- “abc” and “bca”. To account for global order of features, we associate
fication, the model is updated right away. However, in clustering, a random but constant id hypervector with each window, which
the model is fixed and used for finding the similarities, and a new is XORed with the window hypervector to perform binding. To
model is created from scratch, which replaces the current model in skip the global binding in certain applications, id hypervectors are
the next iteration.
set to {0} Dℎ𝑣 . Equation (1) formalizes our encoding, where 𝜌
( 𝑗)
1118
GENERIC: Highly Efficient Learning Engine on Edge using Hyperdimensional Computing DAC ’22, July 10–14, 2022, San Francisco, CA, USA
Permuting and XORing window’s levels
Permuting level hypervectors Pixel value One id per each index
+1 -1 -1 +1 +1 +1 ... +1
+1 -1 -1 +1 +1 +1 ... +1 × -1 -1 +1 -1 +1 -1 ... +1 id0 ≪0
Level Hypervectors ≪0 ⊕
+
+1 -1 -1 +1 +1 +1 ... +1 + +1 -1 -1 +1 +1 +1 ... +1
≪1
-1 +1 -1 -1 +1 -1 ... +1
+1 +1 +1 -1 +1 -1 ... -1 241 × -1 +1 +1 -1 -1 +1 ... -1 idm ⊕
+1 +1 +1 -1 +1 -1 ... -1 ≪m -1 +1 -1 -1 +1 -1 ... +1
D = 5,000 + + ≪2
L ⋅L ≈0 ⊕
L ⋅ L ≈ 2500 -1 +1 -1 -1 +1 -1 ... +1 122 × -1 -1 +1 +1 +1 -1 ... +1 idn One id per -1 -1 +1 +1 +1 -1 ... +1 idk
window
L ⋅ L ≈ 2500
≪n +
+91 -3 -24 +61 -40 -71 ... +12
+9 0 -2 +1 -4 -7 ... +10
+12 -7 0 +4 -3 +2 ... -13
(a) (b) (c) (d)
Figure 2: (a) Level hypervectors, (b) permutation encoding, (c) random projection encoding, (d) proposed GENERIC encoding.
Table 1: Accuracy of HDC and ML algorithms. Train Inference
1E+3
GENERIC
GENERIC
RP
MLP
SVM
DNN
RP
MLP
SVM
RP
DNN
LR
KNN
RF
LR
KNN
RF
DNN
level-id
GENERIC
level-id
GENERIC
level-id
GENERIC
LANG ͺǤʹΨ ͷǤͻΨ ͳͲͲǤͲΨ ͷ2Ǥ8Ψ ͳͲͲǤͲΨ ͷǤͶΨ ͵ͲǤͺΨ ͳͲǤ3Ψ ͻͻǤͻΨ
MNIST ͻͶǤΨ ͺͻǤͶΨ ͷ͵ǤͲΨ ͺͻǤ͵Ψ ͻͶǤͲΨ ͻǤΨ ͻǤͻΨ ͻǤͲΨ ͻͻǤͳΨ
PAGE ͻǤͳΨ ͻͳǤΨ ͻͳǤΨ ͻͳǤΨ ͻͳǤͺΨ ͻǤͷΨ ͻǤͻΨ ͻǤͶΨ ͻͷǤͺΨ (a) Energy Consumption
PAMAP2 ͺ͵ǤͲΨ ͻͶǤΨ ͲǤͻΨ ͻͷǤͺΨ ͻ͵ǤͺΨ ͻʹǤͻΨ ͻͳǤͻΨ ͻͷǤΨ ͻǤͳΨ
UCIHARͻ͵ǤͶΨ ͻͶǤΨ ͶǤͻΨ ͻͶǤΨ ͻͶǤͻΨ ͻͶǤΨ ͻͷǤͺΨ ͻͷǤΨ ͻǤͷΨ Train Inference
1E+3
Mean ǤͲΨ ͻͲǤͲΨ ǤͺΨ ͺͺǤ3Ψ ͻ͵Ǥͷ
ͷΨ ͺʹǤͺΨ ͺǤͲΨ ͺͷǤ͵Ψ ͻʹǤͷΨ Time per input (mSec) Raspberry Pi CPU eGPU
STDV ʹǤͷΨ ǤͻΨ ͳͻǤʹΨ ͳ2Ǥ4Ψ ͶǤͶΨ ʹǤͻΨ ͳͻǤͲΨ ʹͶǤ4Ψ ͳͲǤͺΨ 1E+2
1E+1
RP encoding fails in time-series datasets that require temporal in-
1E+0
formation (e.g., EEG). As explained in subsection 3.1, the ngram
encoding [6, 14] do not capture the global relation of the features, 1E-1
GENERIC
GENERIC
GENERIC
RP
MLP
SVM
DNN
RP
MLP
SVM
DNN
RP
DNN
LR
level-id
KNN
RF
LR
level-id
KNN
RF
level-id
GENERIC
GENERIC
GENERIC
nition (MNIST). Except for the ngram and the proposed GENERIC,
other HDC techniques fail in the LANG (text classification) as they
enforce capturing sequential information and ignore subsequences. (b) Execution Time
Figure 3: (a) Energy consumption and (b) execution time of
3.3 Efficiency on Conventional Hardware HDC and ML algorithms on different devices.
HDC’s operations are simple and highly parallelizable, however,
with 27× (111×) higher execution time than the most efficient base-
conventional processors are not optimized for binary operations
line (random forest). Nonetheless, eGPU numbers imply substantial
such as one-bit accumulation. Also, the size of hypervectors in most
energy and runtime reduction potential for HDC by effectively tak-
settings becomes larger than the cache size of low-end edge proces-
ing advantage of low-precision operations (achieved by bit-packing
sors, which may impose significant performance overhead. For a
in eGPU) and high parallelism.
detailed comparison, we implemented the HDC and ML algorithms
on the datasets of subsection 3.2 on a Raspberry Pi 3 embedded 4 GENERIC Architecture
processor and NVIDIA Jetson TX2 low-power edge GPU, and also
a desktop CPU (Intel Core i7-8700 at 3.2 GHz) with a larger cache. 4.1 Overview
We used Hioki 3334 power meter to measure the power of the Figure 4 shows the main components of GENERIC architecture.
Raspberry Pi. The main inputs include (i) input port to read an input (including
Figure 3 compares the training and inference (a) energy con- the label in case of training) from the serial interface element by
sumption and (b) execution time of the algorithms, reported as element and store in the input memory before starting the encoding,
the geometric mean of all benchmarks (for eGPU, we omitted the (ii) config port to load the level, 𝑖𝑑, and class hypervectors (in case
results of conventional ML as it performed worse than CPU for of offline training), and (iii) spec port to provide the application
a variety of libraries we examined). We can observe that (i) con- characteristics to the controller, such as Dℎ𝑣 dimensionality, 𝑑
ventional ML algorithms, including DNN, unanimously consume elements per input, 𝑛 length of window, 𝑛𝐶 number of classes or
smaller energy than HDC on all devices, (ii) GENERIC encoding, due centroids, 𝑏𝑤 effective bit-width, and mode (training, inference, or
to processing multiple hypervectors per window, is less efficient clustering). Output port returns the labels of inference or clustering.
than other HDC techniques, and (iii) our eGPU implementation, by The controller, by using spec data, handles the programmability
data packing (for parallel XOR) and memory reuse, significantly of GENERIC and orchestrates the operations. For instance, the en-
improves the HDC execution time and energy consumption. For coder generates 𝑚=16 (architectural constant) partial dimensions
instance, eGPU improves the energy usage and execution time of after each iteration over the stored input, where the variable Dℎ𝑣
GENERIC inference by 134× and 252× over running on low-end signals the end of encoding to finalize the search result, 𝑑 denotes
Raspberry Pi (70× and 30× over CPU). However, GENERIC running the number of input memory rows to be proceeded to fetch fea-
on eGPU still consumes 12× (3×) more inference (train) energy, tures (i.e., the exit condition for counter), 𝑛𝐶 indicates the number
1119
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing
•
ƐƉĞĐ ĐŽŶĨŝŐ ŵ
ŵŽĚĞ ĐĞŶƚƌĂů ĚĂƚĂ ƌĞŐŶ score is passed to the divider 9 . We use an approximate log-based
ůŽĂĚ
ƐƚĂƌƚ ĐŽŶƚƌŽůůĞƌ Đƚƌů
ůĂďĞů ůĞǀĞů
division [18].
фф ƌĞŐŶ−ϭ
Đƚƌů
ŵĞŵŽƌLJ 4.2.2 Training and Retraining: In the first round of training,
ŝŶƉƵƚ ϭ
• •
ĚĂƚĂ
ďŝŶ н ĂĚĚƌ i.e., model initialization, encoded inputs of the same class/label
фф ƌĞŐϭ
ŝŶƉƵƚ are accumulated. It is done through the adder 4 and mux 3
ŵĞŵŽƌLJ of all class memories. The controller uses the input label and the
Đƚƌů ĐŽŶĨŝŐ
Ϯ ƌĞŐŝĚ
ĂĚĚƌ
ŝŶƉƵƚ
ĚĂƚĂ ƚŵƉ iteration counter to activate the proper memory row. In the next
ŝĚ retraining epochs, the model is examined and updated in case of
ĐŽŶĨŝŐ ŵĞŵŽƌLJ
misprediction (see Figure 1). Thus, during retraining, meanwhile
Đƚƌů
н н
ƐƉĞĐ
Encoder ĂĚĚƌ ƌĞŐĞŶĐ performing inference on the training data, the encoded hypervector
•
is stored in temporary rows of the class memories (through the
•
second input of mux 3 ). If updating a class is required, the class
ϯ ϰ rows are read and latched in the adder 4 , followed by reading the
н н corresponding encoded dimensions from the temporary rows and
ϭ͕ϭ ϭ͕ŵ
Ϯ͕ϭ ŵĂƐŬ
× ĐŽŶĨŝŐ
Ϯ͕ŵ ŵĂƐŬ
× writing the new class dimensions back to the memory. Hence, each
ϱ
update takes 3× D𝑚ℎ𝑣 cycles. Training also requires calculating the
ĐŽŶĨŝŐ
ϭ͕ŵнϭ ϭ͕Ϯŵ
•
Đƚƌů Đƚƌů
ĂĚĚƌ Dϭ ĂĚĚƌ Dŵ squared L2 norm of classes in the norm2 memory 8 . As it can be
seen in Figure 4, the class memories are able to pass the output into
Search н ϲ ƐĐŽƌĞ
ϵ both ports of the multipliers (one direct and another through the
н × mux) to calculate and then accumulate the squared elements.
н
/s
ŽƵƚƉƵƚ
ϭ ϭ
Ϯ ĐŽŶĨŝŐ Ϯ 4.2.3 Clustering: GENERIC selects the first 𝑘 encoded inputs as
ϯ ϳ ϯ ϴ ŵĂdž
ĐŽŶĨŝŐ
Đƚƌů ƐĐŽƌĞ Đƚƌů ŶŽƌŵϮ the initial cluster centroids and initializes 𝑘 centroids in the class
ĂĚĚƌ ĂĚĚƌ
memories. It allocates two sets of memory rows for temporary data;
Figure 4: Overview of GENERIC architecture. one for the incoming encoding generated in the encoding module
and another for the copy centroids (as mentioned in Section 2.1,
of class memory rows that need to be read for dot-product and clustering generates a new copy instead of direct update). Similarity
so on. The class memory layout of GENERIC also allows trade off checking of the encoding dimensions with the centroids is done
between the hypervectors length 𝐷ℎ𝑣 and supported classes 𝑛𝐶 . By pipelined similar to inference, but the encoded dimensions are
default, GENERIC class memories can store 𝐷ℎ𝑣 =4K for up to 𝑛𝐶 =32 stored to be added to the copy centroid after finalizing the similarity
classes. For an application with less than 32 classes, higher num- checking. After finding the most similar centroid, the copy centroid
ber of dimensions can be used (e.g., 8K dimensions for 16 classes). is updated by adding the stored hypervector (similar to retraining).
We further discuss it in subsection 4.3. These application-specific The copy centroids serve as the new centroids in the next epoch.
input parameters enable GENERIC the flexibility to implement var-
4.3 Energy Reduction
ious applications without requiring a complex instruction set or
reconfigurable logic. We take advantage of the properties of GENERIC architecture
and HDC for utmost energy efficiency. The following elaborates
4.2 Classification and Clustering energy-saving techniques that benefit GENERIC. These techniques
can also be applied to other HDC accelerators.
4.2.1 Encoding and Inference: Features are fetched one by one 4.3.1 id Memory Compression: The 𝑖𝑑 memory naturally needs
from the input memory and quantized to obtain the level bin, and 1K×4K=512 KB (for up to to 1K features per input, and Dℎ𝑣 =4K
accordingly, 𝑚 (16) bits of the proper level hypervector are read. dimensions) which occupies a large area and consumes huge power.
The levels are stored as 𝑚-bit rows in the level memory. The stacked However, GENERIC generates 𝑖𝑑s on-the-fly using a seed 𝑖𝑑 vector,
registers (reg 𝑛 to 1) facilitate storing and on-the-fly sliding of level
where 𝑘 th 𝑖𝑑 is generated by permuting the seed 𝑖𝑑 by 𝑘 indexes.
hypervectors of a window. Each pass over the input features gener-
Therefore, the 𝑖𝑑 memory shrinks to 4 Kbit, i.e., 1024× reduction.
ates 𝑚 encoding dimensions, which are used for dot-product with
•
Permutation preserves the orthogonality. It is implemented by the
the classes. The class hypervectors are distributed into 𝑚 memories
tmp register in Figure 4 2 , by which, for a new window, the reg id
(CM 1 to CM 𝑚) to enable reading 𝑚 consecutive dimensions at once.
is right-shifted and one bit of tmp is shifted in. The tmp register
•
The dot-product of partial encoding with each class is summed up
helps to avoid frequent access to the 𝑖𝑑 memory by reading 𝑚 (16)
•
in the pipelined adder 6 , and accumulated with the dot-product
bits at once and feeding in the next 𝑚 cycles.
result of previous/next 𝑚 dimensions in the score memory 7 .
After D𝑚ℎ𝑣 iterations, all dimensions are generated, and the dot- 4.3.2 Application-opportunistic Power Gating: For an appli-
product scores are finalized. We use cosine similarity metric be- cation with 𝑛𝐶 classes and using Dℎ𝑣 dimensions, GENERIC stripes
the dimensions 1 to 𝑚 (16) of its 1st class vector in the 1st row of 𝑚
tween the encoding vector H and class C𝑖 : 𝛿𝑖 = H H·×C𝑖C ; hence,
2 𝑖 2 class memories, the 2nd class vector in the 2nd row, and so on (see
we need to normalize the dot-product result with L2 norms. The Figure 4). The next 𝑚 dimensions of the 1st class vector are there-
H 2 can be removed from the denominator as it is a constant fore written into 𝑛𝐶 + 1th row, followed by the other classes. Thus,
and does not affect the rank of classes. In addition, to eliminate
( H· C𝑖 ) 2 GENERIC always uses the first 𝑛𝐶32×4K
×Dℎ𝑣
portion of class memories.
the square root of C𝑖 2 , we modify the metric to 𝛿𝑖 =
•
2
C𝑖 2 The applications of Section 3.2, on average, fill 28% of the class
without affecting the predictions. The norm2 memory of Figure 4 8 memories (minimum 6% for EEG/FACE, and maximum 81% for ISO-
stores the squared L2 norms of classes, and similarly, the squared LET) using Dℎ𝑣 =4K dimensions. Accordingly, GENERIC partitions
1120
GENERIC: Highly Efficient Learning Engine on Edge using Hyperdimensional Computing DAC ’22, July 10–14, 2022, San Francisco, CA, USA
ͻʹΨ ͻͲΨ ͻʹΨ ൈ ͻʹΨ
ͷൈ ͷൈ
ͺͺΨ ͺͲΨ ͺͶΨ ͺͶΨ
Ͷൈ Ͷൈ
ͺͶΨ ͲΨ Ψ Ψ
͵ൈ ͵ൈ
ͺͲΨ ͲΨ ͺΨ ʹൈ ͺΨ ʹൈ
Ψ ͷͲΨ ͲΨ ͳൈ ͲΨ ͳൈ
Ͳ ͲǤͷ ͳ ͳǤͷ ʹ ʹǤͷ ͵ ͵Ǥͷ Ͷ Ͳ ͲǤͷ ͳ ͳǤͷ ʹ ʹǤͷ ͵ ͵Ǥͷ Ͷ ͲΨ ʹΨ ͶΨ Ψ ͺΨ ͳͲΨ ͲΨ ʹΨ ͶΨ Ψ ͺΨ ͳͲΨ
ሺൈ
ൈͳͲͲͲሻሻ ሺൈ
ൈͳͲͲͲሻሻ
Figure 5: Accuracy with constant and updated L2 norm. Figure 6: Accuracy and power reduction wrt memory error.
each class memory into four banks and power gates the unused Ǥ͵Ψ
control
ϵ͘ϲй ϴ͘ϱй
datapath
banks. With four banks, 1.6 out of four banks are activated on av- ϵ͘ϰй
base mem
erage, leading to 59% power saving. With more fine-grained eight
feature mem
banks, 2.7 banks (out of eight) become active, saving 66% power. ϴϴ͘ϰй ͻͳǤʹΨ ϳϵ͘ϵй
level mem
However, eight banks impose 55% area overhead compared to 20% class mem
of four banks (see Section 5.1 for setup). We concluded that the four- (a) Area (b) Static Power (c) Dynamic Power
bank configuration yields the minimum area×power cost. Since the Figure 7: Accuracy and power reduction wrt memory error.
power gating is static (permanent) for an application, no wake-up
latency or energy is involved. Thus, voltage over-scaling for ISOLET is only preferred in work-
loads with a higher idle time where the static power dominates
4.3.3 On-demand Dimension Reduction: GENERIC can trade (voltage scaling reduces the static power more significantly).
the energy consumption and performance with accuracy. Recall that
GENERIC generates 𝑚 dimensions of the encoding per iteration over 5 Results
the features. By feeding a new Dℎ𝑣 value as input, GENERIC can 5.1 Setup
seamlessly use the new dimension count by updating the counter
exit condition, so smaller hypervectors of the encoding and class hy- We implemented GENERIC at the RTL level in SystemVerilog
pervectors will be used. Nevertheless, GENERIC stores the squared and verified the functionality in Modelsim. We used Synopsys De-
( H· C𝑖 ) 2 sign Compiler to synthesize GENERIC targeting 500 MHz clock with
L2 norms of the whole classes for similarity metric (𝛿𝑖 = ) 14 nm Standard Cell Library of GlobalFoundries. We used Artisan
C𝑖 22
while for arbitrary reduced encoding dimensions, only the corre- memory compiler to generate the SRAM memories. The level mem-
sponding elements (and their L2 norms) of the classes are needed. ory has a total size of 64×4K = 32KB for 64 bins, the feature memory
As Figure 5 shows, using the old (Constant) L2 values causes signif- is 1024×8b, and class memories are 8K×16b (16 KB each). We ob-
icant accuracy loss compared to using the recomputed (Updated) tained the power consumption using Synopsys Power Compiler.
L2 norm of sub-hypervectors. The difference is up to 20.1% for EEG GENERIC occupies an area of 0.30 mm2 and consumes a worst-case
and 8.5% for ISOLET. To address this issue, when calculating the static power of 0.25 mW when all memory banks are active. For
squared L2 norms during the training, GENERIC stores the L2 norms datasets of Section 3.2, GENERIC consumes a static and dynamic
of every 128th -dimension sub-class in a different row of the norm2
•
power of 0.09 mW, and 1.79 mW, respectively (without voltage scal-
memory 8 .Thus, dimensions can be reduced with a granularity of ing). Figure 7 shows the area and power breakdown. Note that the
128 while keeping the norm2 memory small (2 KB for 32 classes). level memory contributes to less than 10% of area and power. Hence,
using more levels does not considerably affect the area or power.
4.3.4 Voltage Over-scaling: GENERIC has to use 16-bit class di-
mensions to support training. As a result, the large class memories 5.2 Classification Evaluation
consume ∼80% of the total power. HDC exhibits notable tolerance
5.2.1 Training: Since previous HDC ASICs have not reported
to the bit-flip of vectors [19], which can be leveraged to over-scale
training energy and performance, we compare the per-input energy
the memory voltage without performance loss. Figure 6 shows the
and execution time of GENERIC training with RF (random forest,
accuracy of select benchmarks (ISOLET and FACE) with respect
most efficient baseline) and SVM (most accurate conventional ML)
to the class memory error. The static (s) and dynamic (dyn) power
on CPU, and DNN and HDC on eGPU. Figure 8 shows the average
saving as a result of corresponding voltage scaling (without re-
energy and execution time for the datasets of Section 3.2. GENERIC
ducing clock cycle) is also shown in the right axis (based on the
improves the energy consumption by 528× over RF, 1257× over
measured data of [20]). The figure shows the result of the HDC
DNN, and 694× over HDC on eGPU (which, as discussed in Section
•
models with different bit-width (𝑏𝑤 input parameter of GENERIC)
3.3, is the most efficient baseline device for HDC). GENERIC con-
of classes by loading a quantized HDC model (the mask unit 5 in
sumes an average 2.06 mW of training power. It also has 11× faster
the architecture masks out the unused bits). As it can be seen, error
train time than DNN and 3.7× than HDC on eGPU. RF has 12×
tolerance not only depends on application but also on the bit-width.
smaller train time than GENERIC, but as we mentioned, the overall
1-bit FACE model shows a high degree of error tolerance (hence,
energy consumption of GENERIC is significantly (528×) smaller than
power saving) by up to 7% bit-flip error rate, while ISOLET pro-
RF. Also, we used constant 20 epochs for GENERIC training while
vides acceptable accuracy by up to 4% bit-flip using a 4-bit model.
the accuracy of most datasets saturates after a few epochs.
Quantized elements also reduce the dynamic power of dot-product.
Voltage over-scaling also depends on the application’s sensitivity 5.2.2 Inference: We compare the energy consumption of GENERIC
to dimension reduction and its workload. For instance, FACE has a inference with previous HDC platforms from Datta et al. [10], and
higher tolerance to voltage scaling than dimension reduction (see tiny-HD [8]. We scale their report numbers to 14 nm according to
Figure 5). On the other hand, ISOLET is more sensitive to voltage re- [21] for a fair comparison. We also include the RF (most efficient
duction but achieves good accuracy down to 1K dimensions (Figure ML), SVM (most-accurate ML) and DNN on HDC on eGPU (most-
5), which means 4× energy reduction compared to 4K dimensions. efficient HDC baseline). Figure 9 compares the energy consumption
1121
DAC ’22, July 10–14, 2022, San Francisco, CA, USA Behnam Khaleghi, Jaeyoung Kang, Hanyang Xu, Justin Morris, Tajana Rosing
1E+2
1E+1 This work was supported in part by CRISP, one of six centers in
1E+0
JUMP (an SRC program sponsored by DARPA), SRC Global Research
1E-1
1E-2
Collaboration (GRC) grant, and NSF grants #1911095, #1826967,
1E-3 #2100237, and #2112167. We would like to thank Amin Kalantar
and Onat Gungor for helping in Raspberry Pi experiments.
1122