Approximation - and Quantization-Aware Training For Graph Neural Networks
Approximation - and Quantization-Aware Training For Graph Neural Networks
Abstract—Graph Neural Networks (GNNs) are one of the best- data, and these algorithms need to offer high expressive power
performing models for processing graph data. They are known to in the modeling of interactions between different objects. Nev-
have considerable computational complexity, despite the smaller ertheless, the flexibility of graph representations challenges the
number of parameters compared to traditional Deep Neural
Networks (DNNs). Operations-to-parameters ratio for GNNs can capabilities of DNNs, boosting the demand for architectures
be tens and hundreds of times higher than for DNNs, depending supporting graph-represented data. Graph Neural Networks
on the input graph size. This complexity indicates the importance (GNNs) have drawn ample attention since they can generalize
of arithmetic operation optimization within GNNs through model the ability of DNNs to learn deep representations to graphs
quantization and approximation. In this work, for the first
time, we combine both approaches and implement quantization-
and their elements. This has led to GNNs emerging as one of
and approximation-aware training for GNNs to sustain their the best-performing models in such tasks as recommendation
accuracy under the errors induced by inexact multiplications. systems [1], interaction and dynamics modeling in physics [2],
We employ matrix multiplication CUDA kernel to speed up the protein interface and chemical reaction prediction [3], circuit
simulation of approximate multiplication within GNNs. Further, design [4] and others. An important GNN trait is its typically
we demonstrate the execution speed, accuracy, and energy
efficiency of GNNs with approximate multipliers in comparison small model size, but the number of computations they need
with quantized low-bit GNNs. We evaluate the performance of to perform is very high and depends on the input graph size.
state-of-the-art GNN architectures (i.e., GIN, SAGE, GCN, and For instance, the GIN model from Fig. 1 has merely 6,437
GAT) on various datasets and tasks (i.e., Reddit-Binary, Collab parameters, but on a relatively simple Reddit-Binary dataset
for graph classification, Cora and PubMed for node classification)
with a wide range of approximate multipliers. Our framework
needs to perform up to 20 million Multiply-Accumulate (MAC)
is available online: https://github.com/TUM-AIPro/AxC-GNN. operations per graph. In contrast, ResNet18 DNN [5] with
11.7 million parameters performs 1.8 billion MACs per one
Index Terms—Graph neural network, approximate computing,
quantization, deep learning. image from a complex Imagenet dataset [6]. A comparison
of these values shows that GNN operations-to-parameters ra-
I. INTRODUCTION tio is almost 20x times higher than for a DNN. Additionally,
GNN models have a significantly higher variety of operators
0018-9340 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
600 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 2, FEBRUARY 2024
Fig. 1. Experiments overview for Graph Isomorphism Network (GIN) and Reddit-Binary dataset. GINConv-2 between GINConv-1 and GINConv-3 is not
shown and is the same as GINConv-3. Blue arrows within the model indicate quantized data flow, while black ones represent 32-bit floating point (FP32).
Blue blocks representing Linear and ReLU layers are quantized, and black blocks remain FP32. Quantization-aware training of GNN is conducted first. Then,
the performance of quantized pre-trained GNNs with approximate multipliers during inference is evaluated. Finally, approximation- and quantization-aware
GNN training is executed and network performance is evaluated again.
3D convolutions but any 3D map can also be represented as a training for GNNs and evaluate new potential techniques in
graph, allowing the usage of GNNs. These factors encourage this regard, we developed a GPU-accelerated PyTorch-based
the development of new GNN models with a more energy- framework for approximation operation simulation to expedite
efficient inference. The extent of research in this field is sub- the research. Existing frameworks for DNNs do not support
stantial for DNNs, involving quantization [10], pruning and GNN architectures, which leads us to introduce our own imple-
model compression [11], or approximate operations [12], [13]. mentation that published alongside our manuscript. Our frame-
However, improving the GNN inference efficiency is still in work opens up possibilities for any further research for GNN
its infancy. approximation or any approximate linear operations, which
One of the most common ways to improve network in- are widely used in transformers. We release our entire frame-
ference efficiency is to perform operations at a reduced pre- work along with examples and, even though it is written in
cision by quantizing the model’s parameters and activations. CUDA, it can be straightforwardly modified for custom approx-
Recent works have successfully applied various quantization- imate multipliers once the behavioral C-code is available for
aware training techniques to GNNs [14], [15], [16], [17], [18], these multipliers.
demonstrating their feasibility. Considering an already small Our main contributions within this paper are as follows:
GNN memory footprint, further reduction of the network size (1) We investigate for the first time approximation-aware
with lower-bit quantization is not paramount. In contrast, DNNs training in the context of Graph Neural Networks (GNNs),
and especially generative pre-trained transformers(GPTs) for their unique architectures and tasks. GNNs benefit from ap-
language models benefit from quantization immensely, allow- proximated higher bit-width operations, making them po-
ing them to reduce their memory footprint from hundreds of tentially more important for GNN performance optimization
gigabytes. However, it is highly beneficial to optimize calcu- than quantization.
lations in all network types. Despite having a low number of (2) We implemented GPU-based acceleration for
parameters, GNNs need to perform an extremely high number approximation-aware GNN training with a bit-width-invariant
of operations on graph data. Thus, approximation could be more approximation simulation framework that employs a custom
important and impactful for certain GNN architectures and CUDA kernel to realize approximate matrix multiplication.
tasks. GNN approximation was not explored before and a wide In contrast to most state-of-the-art methods, we do not rely
variety of different GNN operators introduce additional chal- on look-up tables, which quickly become unfeasible for
lenges for comprehensive GNN approximation study. There- higher bit-widths and cannot be used at all for approximate
fore, the effectiveness of approximation-aware training requires floating-point arithmetic.
distinguished evaluation for GNNs, and we analyzed the effect (3) Our framework is released as an open source with
of approximate multiplications on a wide variety of popular examples to enable the research community to explore
GNN architectures and tasks. We unveiled that approximate approximation-aware training for GNNs and evaluate new
multiplication in GNNs can be a higher-performing alternative potential techniques in this regard.
to quantization. However, a huge number of diverse GNN op- (4) We perform energy efficiency and execution speed analy-
erators complicates the search for a general applicable solu- sis of approximate GNNs in comparison with low-bit quantized
tion, since different operators do not behave in the same way GNNs. We also evaluate the trade-offs of approximate GNN
under approximation. accuracy to speed, power, and area gains on the accelerator level
Additionally, a framework for approximation simulation with various approximate multipliers.
within GNN layers is yet to be introduced. To enable the Section II introduces the basics of quantized GNNs and ap-
research community to explore further approximation-aware proximate arithmetic circuits. Used in the paper GNN operators
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
NOVKIN et al.: APPROXIMATION- AND QUANTIZATION-AWARE TRAINING FOR GNNS 601
and methodology for approximation-aware training are given in to layers are quantized to a lower bit-width, which introduces
Section III. Results and the experimental setup are discussed in additional errors in the forward pass. These errors are then
Section IV with a conclusion provided in Section V. minimized by an optimizer since they became a part of a train-
ing loss. It reduces or even completely mitigates the effect of
II. RELATED WORKS quantization errors on the network performance.
Over the years, a substantial amount of research was con-
In Section II-A, we introduce the basics of GNNs. Quantiza-
ducted for QAT of GNNs, mainly focusing on enabling the
tion methods, applied to GNNs in previous works, are covered
possibility to use low-precision arithmetics during inference.
in Section II-B. The background of approximate arithmetic
Degree-Quant [14] proposes an architecture-agnostic stochastic
circuits is given in Section II-C.
quantization method, which utilizes node ranks to highlight
nodes with higher importance and manipulate their quantization
A. Graph Neural Networks probability. HuGraph [15] explores neighbor-sampling-based
GNN is a highly effective model for machine learning on mini-batch training of fully-quantized GCN and maps it to
graph-based data. This paper employs convolutional GNNs FPGA. QGTC [16] decomposes any-bit-width computation into
[19], which utilize convolution operators, generalized to a graph a 1-bit one and introduces various computation and memory op-
domain. Spectral convolutions work with spectral representa- timizations. SGQuant [17] offers multi-granularity quantization
tions of graphs and define convolution operation in spectral with automatic bit selection for each one of the four types of
domain [20], [21]. Spatial convolutions utilize graph topology granularity. And finally, Bi-GCN [18] explores different bina-
and define graph convolutions by information propagation [1], rization techniques.
[22]. Message Passing (MP) outlines a general procedure for Even though quantized GNNs demonstrate great perfor-
spatial-based convolutional GNNs. The core concepts of MP mance, it is not always possible to quantize GNNs further to
GNNs are message passing and node embedding. First, an lower bits without a significant accuracy loss. Due to the low
embedding hv of a node v is initialized from the node’s input memory footprint, GNN parameters can remain in higher-bit
features xv : representations and perform approximate computations instead.
This approach preserves the expressiveness from higher-bit pa-
v = xv : ∀v ∈ V
h(0) (1) rameter and activation values while utilizing energy and speed
Then nodes u from the local neighborhood N (v) “message” efficiency of approximate operations.
their embedding information to the node v. It provides a more
refined representation of the neighboring nodes. This informa- C. Approximate Arithmetic Circuits
tion is aggregated and combined with the node’s own embed-
(k)
ding, generating node embedding hv for iteration k: There are multiple ways to design an approximate arith-
metic circuit [23]. Algorithm simplification reduces computa-
h(k) (k−1)
v = AGGk (hv , {h(k−1)
u : u ∈ N (v)}) (2) tional complexity and is especially useful for such operations
where 1 ≤ k ≤ K. Next, it is passed through an update function, as multiplication, division, or square root. Voltage overscaling
which is usually represented by a normal NN operation, such involves lowering supply voltage, which greatly reduces power
as a linear layer W · x + b with weight matrix W and bias consumption without changing the circuit structure, but intro-
b. This process is repeated K times, depending on the GNN duces errors in the output. Circuit redesign is one of the most
depth. In other words, nodes are mapped to a multidimensional common approaches for the transformation of a regular logic
embedding space in such a way that similar nodes will be closer circuit into an approximated one. Circuit redesign starts from
to each other within this space. Aggregation or update methods, an exact circuit, which is then iteratively modified based on
as well as their order, can vary, depending on the GNN type. specific design constraints and objectives. The most efficient
method is to automate the redesign process and select parts
that need to be modified randomly or heuristically. Examples
B. Graph Neural Network Quantization of such approaches are SALSA [24], ALFANS [25], and var-
Quantization reduces the NN precision by mapping its ious Cartesian Genetic Programming (CGP) implementations
parameters to lower-bit representations, which curtails net- [26], [27].
work memory footprint and enables faster and more energy- Multiplication is one of the most common operations in
efficient low-bit computation. Quantization-Aware Training neural networks. Making it more power-efficient is crucial for
(QAT) proved to be the superior approach for NN quantization, high inference efficiency. Approximate multiplication for DNN
thus it was employed for GNNs as well. QAT simulates low- acceleration received a lot of attention in recent years [13], but
precision behavior in the forward pass, while normally keeping its impact on GNNs has not yet been explored. In this paper for
the backward pass in floating point (FP32) since approximated quantized GNNs, we employ approximators from EvoApprox-
gradients can significantly hinder the network’s capability to Lib [27], specifically, 8-bit signed multipliers obtained through
learn. This QAT approach is also called simulated quantiza- CGP and multi-objective evolutionary algorithm [28]. EvoAp-
tion because quantized values are dequantized (scaled back) to proxLib provides a variety of multipliers with gradually in-
the original range before performing any operation or gradient creasing mean relative error (MRE). It enables a comprehensive
computation. During training, network parameters and inputs analysis of their effect on GNN performance and comparison
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
602 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 2, FEBRUARY 2024
with low-bit quantization methods. All multipliers in EvoAp- Dequantization also enables precise gradient computation with
proxLib have deterministic behavior, meaning that for the same a Straight-Through Estimator (STE) [30].
input, they give the same output. This property allows GNNs Additionally, we employ Degree-Quant with percentile track-
or other network types to learn the approximation behavior and ing of quantization ranges [14], which utilizes a probability
mitigate or reduce its effect. Additionally to hardware models in mask to stochastically disable quantization for nodes, based
Verilog, EvoApproxLib provides behavioral models of all their on their degree. In our work, only message passing is mask-
multipliers in C code, which is mandatory for our framework, aware, since the authors of the method stated that in graph
since we employ these behavioral models in our CUDA kernel. convolutions stochastic quantization only marginally increases
To demonstrate the applicability of our framework to other the performance and adds to the complexity of quantization-
approximate multiplier types, we also perform approximation- aware training. Since approximate multiplication needs to be
aware training on a Dynamic Range Approximate Logarithmic performed on low-bit values, we do not dequantize them for
Multiplier (DRALM) [29]. multiplication or addition, but perform these operations with
original quantized values and dequantize the result. Nonethe-
III. METHODOLOGY FOR QUANTIZATION- AND less, we still use STE for back-propagation.
APPROXIMATION-AWARE GNN TRAINING
In Section III-A, we describe the uniform quantization, which B. Graph Neural Network Operators
maps 32-bit floating point (FP32) values to a lower-bit range, In this paper, we employ the four most common GNN oper-
such as 8 or 4. Its implementation and GNN-specific ad- ators, which often act as an initial baseline for more advanced
justments are also discussed. Uniform quantization is applied operators.
to GNN architectures, defined in Section III-B. Quantization Graph Convolutional Network (GCN) [31] is one of the
and approximation-aware training is performed for all GNNs earliest approaches able to generalize convolutions to a graph
and various datasets. Approximation-aware training methodol- domain. It’s node v embedding update hv for any layer k is
ogy and used experimental framework are presented in Sec- given by:
tion III-C. The experiment workflow is detailed at the beginning e
of Section IV and summarized in Fig. 1. h(k)
v =Θ
T
u,v h(k−1)
u (6)
u∈N (v)∪{v} dˆu dˆv
A. Uniform Quantization dˆv = 1 + eu,v (7)
u∈N (v)
GNN quantization involves mapping their weights and acti-
vations to a lower bit-width. Network weights are often closely where Θ is a linear transformation, e is an edge weight and
distributed around zero. Rectified Linear Unit (ReLU) is one N (v) is a local neighborhood of the node v. GCN is capable
of the most common and widely-used activation functions, of encoding graph structure and node features, which is useful
and its output is always positive. Thus, it is crucial to find a for various graph- and node-level tasks.
proper quantization approach to reflect this behavior of network Graph Sample and Aggregate (SAGE) framework [1] intro-
weights and activations. In our work, we employ uniform quan- duces inductive learning on large graphs with sampling, where
tization, which uses the following function to map FP32 values node v embedding hv for layer k is denoted as:
wi into low-precision quantized ones: 2
w (k)
i
h(k)
v = ReLU (W (k)
· CON CAT (h (k−1)
v , h N (v)
)) (8)
Q(wi ) = min(Qmax , max(Qmin , round + Z)) (3)
S (k)
u , ∀u ∈ N (v))
hN (v) = AGGk (h(k) (9)
where S is a scaling factor, Z is a zero-point, Qmin and
(k−1)
Qmax are minimum and maximum values represented in a where W (k) is a tensor of learnable weights, hv is node
given bit-width. Setting zero-point Z = 0 makes zero exactly embedding in the previous iteration (graph conv-layer), AGG
represented in lower precision, resulting in Q(0) = 0, since is max, sum or mean aggregation of node embedding from
Qmin ≤ 0. Scaling factor S divides a given range of FP32 val- the local neighborhood N . For learning weight matrices W (k)
ues wi ∈ [wmin , wmax ] into several partitions and is calculated authors of the original paper also propose a graph-based loss
as follows: function that encourages nearby nodes to have similar embed-
wmax − wmin dings, while forcing embeddings of distant nodes to be distinct.
S= (4)
Qmax − Qmin Graph Isomorphism Network (GIN) [22] node v embedding
update hv for any layer k can be described with the follow
If Qmin = 0 and Z = 0, Eq. 3 produces only non-negative
ing formula:
quantized values. This is well suited for ReLU activation. When
Qmin = 0, quantization becomes symmetric around zero, so h(k) k
v = M LP ((1 +
(k)
) · h(k−1)
v + h(k−1)
u ) (10)
it can be used to map network weights into a low-precision u∈N (v)
domain. To simulate quantization behavior in arithmetic oper-
ations, values Q(wi ) are dequantized: where M LP is a multi-layer perceptron with non-linearity,
(k−1)
(k) is a learnable parameter, hv is node embedding in the
ŵi = S(Q(wi ) − Z) (5) previous iteration and N is a local neighborhood of the node
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
NOVKIN et al.: APPROXIMATION- AND QUANTIZATION-AWARE TRAINING FOR GNNS 603
v. Notably, (k) denotes the importance of node v relative to its and optimized tools for regular matrix multiplication, which
neighbors and can be set to zero to represent equal importance. enable fast network inference and training. It means that for
M LP usually consists of multiple layers, since the original quantization simulation, matrices need to be quantized and then
paper’s authors stated that one layer is insufficient. In addition, multiplied using regular PyTorch tools. Thus, the quantization
a unique approach to produce better graph embeddings is also simulation overhead is mn + np operations, which translates
proposed. It involves the summation of all node embeddings to 2d2 if m = n = p for square matrices d × d. Approximate
for each layer and the concatenation of the result. This solution multiplication cannot employ regular PyTorch matrix multipli-
combines the great expressiveness of the sum operator with the cation tools, meaning that after the same mn + np operations
ability to preserve information about embeddings from previous for quantization, it requires additional mnp operations to sim-
layers through concatenation. ulate the approximation behavior (d3 for square matrices). As
Graph Attention Network (GAT) [32] leverages masked self- can be seen, this computational overhead grows exponentially
attentional layers to address the shortcomings of graph convo- and can be detrimental for very big matrices (graphs). Thus,
lution methods. GAT node v embedding update hv for any layer it is important to conduct approximation simulation on highly
k denoted as: parallelized GPU devices, similar to regular PyTorch tools.
In recent years multiple frameworks were developed for
h(k)
v = αi,i Θhv
(k−1)
+ αi,j Θh(k−1)
u (11)
DNN approximation-aware training. ProxSim [33], TFApprox
u∈N (v)
[34], and AdaPT [35] implemented an approximate multiplier
(k−1) (k−1)
exp(ACT (aT [Θhv ||Θhu ])) in a form of a Look-Up Table (LUT). It needs to be generated
αi,j = (k−1) (k−1)
(12)
preliminary and scales with the approximator bit-width. For
exp(ACT (aT [Θhv ||Θhn ]))
n∈N (v)∪{v} example, LUT for the n-bit integer approximate multiplier will
Θ is a linear transformation and α is an attention coefficient, have the size of 2n x2n . Any DNN framework cannot be easily
which assigns different importance to different nodes within applied to GNNs and requires a substantial rework. Thus, it is
a neighborhood. a in Eq. (12) is an attention weight vector preferable to integrate quantization and approximation simula-
and ACT is a LeakyReLU activation function. The attention tion into an existing GNN framework instead.
mechanism is computationally efficient and does not depend In this work, we introduce a CUDA kernel for PyTorch,
on knowing the entire graph structure upfront, enabling better capable of performing an approximate matrix multiplication
evaluation results on completely unseen graphs. simulation on NVIDIA GPUs with PyTorch Geometric [7]
GNN framework. This kernel is implemented as a tiled General
C. Approximation-Aware Training Matrix Multiplication (GEMM), where each GPU thread loads
two tiles: a row from matrix A and a column from matrix B.
Approximation-Aware Training (AAT) follows a similar Then, each thread calculates one output element for the result of
logic as Quantization-Aware Training (QAT). It simulates an A · B. In contrast to previously mentioned DNN frameworks,
approximate operation behavior in the forward path during we do not utilize LUTs to generate the output of approximate
the training process. During AAT regular arithmetic operations multiplication. Instead, the approximate multiplication function
are replaced with their approximate variants. In the case of is applied directly to the elements of matrices A and B inside
8-bit approximate multipliers, values also need to be cast in an the kernel. Thus, we do not need to preliminary generate LUT,
integer format before multiplication. This simulation introduces store it in GPU memory, and take into consideration LUT
approximation errors, which become a part of a training loss scaling for various bit-widths.
and an optimizer tries to minimize them. Thus, AAT helps to To evaluate the speed of our framework, we measure the
learn deterministic approximation behavior and tries to mitigate total training time required for quantized or quantized and ap-
its effect on network accuracy. Retraining DNNs with approx- proximated networks with the same set of training parameters
imation errors proved to be a very reliable approach to boost and data. We also compare the execution time of our approxi-
approximate DNN performance [12], [13]. However, it extends mate matrix multiplication with a regular torch.mm function. In
the training time, since all approximation operations need to both of those experiments, we did not observe any noticeable
be simulated, which could be devastating for high-dimensional time overhead for approximate multiplication. Unfortunately,
GNN inputs. PyTorch CUDA kernels are not open-sourced, so we are limited
Compared to QAT, AAT introduces a much higher compu- to a comparison on a higher level. Since our goal was to enable
tational load for simulation. Matrix multiplication A · B is the GPU-accelerated approximate matrix multiplication for GNN
most common operation during network inference. Assuming training, we considered the total GNN training time a viable
A is a m × n matrix and B is a n × p matrix, any element of metric for the CUDA kernel performance evaluation.
C = A · B is a summarized element-wise multiplication of the One of the major difficulties for AAT is that the gradient of
matrix A row i with the column j of the matrix B: the approximate or quantization operation cannot be derived. A
n neural network needs to know what operations are performed
cij = aik bkj (13) on its input data. This knowledge is obtained through the usage
k=1 of differentiable functions within the network. Knowing the
Thus, in order to calculate one element of C, n number of gradients of all operations within a neural network is manda-
multiplications needs to be done. PyTorch [36] has very fast tory during the training process since they are required for
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
604 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 2, FEBRUARY 2024
back-propagation. Similar to quantization, approximate multi- output layers quantized. For node classification datasets (Cora
plication is not a differentiable function. To estimate the gra- and PubMed), GNNs are fully quantized.
dient of a quantization operation, a straight-through estimator In the next step, robustness towards approximation errors
[30] is used, meaning that the gradient output of the quantiza- during inference is evaluated for pre-trained 8-bit GNNs. Ap-
tion function equals its gradient input. However, this approach proximate multiplications are performed in all quantized linear
is not suitable for approximate multiplication, since there is operations W · x + b. Then, approximation-aware training is
an underlying principle behind it, which is a regular multi- conducted to boost GNN accuracy with approximate multipli-
plication. Thus, in our experiments, we used the gradient of ers. All GNN training is done on NVIDIA A100 GPUs. Multi-
regular matrix multiplication to estimate the approximate one, ple training runs are conducted for each experiment and the best
which is a common practice for DNNs. Nonetheless, with the accuracy is reported for respective GNNs and datasets. Next,
increase in the mean relative error, estimated gradient values we evaluate the accuracy-to-energy and accuracy-to-delay ef-
start to deviate significantly from the values, which approximate ficiency of approximate and quantized GNNs. Quantized and
multiplication is supposed to have. It results in more incorrect approximated GNNs have the same architecture in all experi-
parameter updates, which hinders the network’s capability to ments, meaning that the same layers remain FP32 in both cases,
learn and converge to the best solution. Using other approaches so their energy contribution does not affect the evaluation. In
to estimate the gradient of an approximate multiplication can our experiments with graph classification tasks, the percentage
potentially improve the result, for example, modeling approxi- of FP32 operations is extremely low, reaching only 0.6% for
mate multiplication as noise [37]. GIN and 1.5% for SAGE, for example. However, there are
multiple factors that affect the percentage of FP32 operations
in partially quantized or approximated GNNs. These factors
are GNN architecture, a number of hidden parameters, and a
IV. EXPERIMENTAL RESULTS
number of features and classes in a dataset. More details on the
For GNN training and evaluation we employ PyTorch Ge- influence of these factors will be given later in Section IV-A.
ometric framework [7] with modified network modules and Finally, we generate systolic MAC arrays with approximate
integrated CUDA kernel for approximate matrix multiplication multipliers and analyze accelerator-level energy, power, and
simulation. To minimize the effect of biases on GNN accuracy, area efficiency of approximate multiplication.
they were disabled everywhere except the last output layers. We
also do not train parameter (k) for GIN (see Eq. (10)) and do A. Quantization-Aware GNN
not normalize embeddings for SAGE (see Eq. 8). This is done Starting with a model compression through quantization,
to reduce the influence of these non-quantizable operations on weights, and activations are quantized with bW and bA = bW −
network accuracy. Fig. 1 gives an overview of the experimental 1 bit-widths respectively. We employ signed approximate multi-
setup for the GIN and Reddit-Binary dataset, as an example. plicators, thus GNN weights are transformed to the signed 8-bit
First, we perform quantization-aware training and present ex- format. However, GNN ReLU activations are always positive
perimental results for quantized to different bit-widths GNNs and layer inputs will have an unsigned 8-bit format. Therefore,
on Reddit-Binary, Collab [38], Cora and PubMed [39] datasets. arithmetic operations in layers would be conducted with two
All networks have similar structures for the corresponding two different data types, which is not supported by used approx-
dataset groups: a) three Conv layers with 32 hidden parameters, imate multipliers. To overcome this limitation, we quantize
followed by two linear layers, for Reddit-Binary and Collab activations (i.e., layer inputs) to 7-bit, bringing activation values
(graph classification); b) two Conv layers with 64 and 128 hid- to a signed 8-bit range. For consistency, we extrapolated this
den parameters for Cora and PubMed respectively (node clas- approach to other bit widths, meaning that multiplications are
sification). We quantize both weights and activations, noted as done in a signed n-bit format.
WbW AbA , where bW and bA are corresponding bit-widths (e.g. All quantization-aware training is done from scratch and,
W8A7). In addition to quantized weights and activations, GNNs based on a learning curve in Fig. 2, QAT does not impose ad-
also require node feature quantization. In our experiments, node ditional difficulty on a GNN training process. Accuracy results
features are quantized in the same way as weights, thus they for quantized GNNs on graph- and node-level tasks are given in
are omitted from the short-form quantization description. For Table I and Table II respectively, where they can be compared
Reddit-Binary and Collab datasets we do not quantize inputs to corresponding full-precision FP32 networks. Results demon-
and first layers, since it lowers the baseline FP32 accuracy by a strate that the accuracy drop is non-linear and becomes more
significant margin, making quantized GNN performance unfea- noticeable starting from W6A5, in most graph-level tasks. On
sible. An input layer is the beginning of the data flow. Thus, it the contrary, GNNs for node-level tasks demonstrate significant
is one of the most sensitive layers within any network, meaning resilience towards quantization, which can be explained by the
that it has reduced tolerance towards any form of precision relative simplicity of datasets in Table II compared to the ones
reduction. The original DegreeQuant quantization [14] leaves in Table I. In general, preserving GNN accuracy appears to be
input and output layers in full precision for the Reddit-Binary harder for lower bit-widths since low-bit values represent model
dataset to avoid a significant GNN accuracy drop when oper- parameters and activations with highly reduced precision. Thus,
ating on nodes with a single scalar as a feature. However, we GNN’s ability to learn deep representations of graph data is
minimized the number of non-quantized operations by leaving significantly hindered.
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
NOVKIN et al.: APPROXIMATION- AND QUANTIZATION-AWARE TRAINING FOR GNNS 605
TABLE I
ACCURACY OF QUANTIZED GNNS, QUANTIZED WEIGHTS AND ACTIVATIONS. GRAPH CLASSIFICATION TASK: REDDIT-BINARY (RB) AND
COLLAB (COL) DATASETS
TABLE II
ACCURACY OF QUANTIZED GNNS, QUANTIZED WEIGHTS AND ACTIVATIONS. NODE CLASSIFICATION TASK: CORA AND PUBMED (PM) DATASETS
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
606 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 2, FEBRUARY 2024
TABLE III
ACCURACY OF QUANTIZED GNNS WITH APPROXIMATE MULTIPLIERS DURING INFERENCE, QUANTIZATION - W8A7. DATASETS:
REDDIT-BINARY (RB) AND CORA
APX Type MRE (%) GIN(RB) SAGE(RB) GCN(RB) GIN(Cora) SAGE(Cora) GCN(Cora) GAT(Cora)
w/o AxC 0.00 0.945 0.930 0.825 0.799 0.773 0.826 0.828
1KV8 0.28 0.940 0.920 0.825 0.794 0.772 0.826 0.828
1KV9 0.90 0.925 0.925 0.820 0.788 0.774 0.826 0.828
1KVP 2.73 0.920 0.910 0.805 0.760 0.767 0.826 0.828
1L2J 4.41 0.880 0.910 0.795 0.759 0.760 0.826 0.827
1L2L 12.26 0.505 0.800 0.570 0.477 0.750 0.826 0.823
1L2N 27.44 0.500 0.515 0.500 0.146 0.711 0.823 0.815
1L12 135.77 0.500 0.500 0.500 0.130 0.200 0.742 0.587
B. Quantization-Aware and Approximation-Aware GNN converge to any reasonable solution. Considering the previ-
To investigate the effect of approximate 8-bit signed multipli- ous experiment (see Table III), 135.77% MRE caused by this
ers on GNNs, they were applied to all quantized linear operators approximator appears to be too high to maintain reasonable
of all W8A7 networks from Table I and Table II. Corresponding network performance. Results for the other three approxima-
results for some datasets can be found in Table III, where multi- tors and the Reddit-Binary dataset with GIN and SAGE are
pliers are noted with their EvoApproxLib name and respective shown in Fig. 3(a)–3(b). Results for other GNNs and the Col-
Mean Relative Error (MRE) in percentage. Only some cases lab dataset are not shown, since they have the same pattern,
are shown since a general tendency is preserved in all tasks. demonstrating that approximation-aware training helped GNNs
Approximate matrix multiplication within GNNs is performed to adapt to inexact multiplication outcomes caused by approx-
with a CUDA kernel, where a specific approximator can be imators. We consider adaptation in most cases because net-
selected. 1KV6 multiplier represents baseline W8A7 accuracy works achieve the highest performance with approximators they
since it introduces no approximation into the calculation. Thus, were trained with (highest peaks in Fig. 3a–3b). The only case
it is noted as “w/o AxC”. Table V demonstrates that the error of where the GNN was able to generalize to all approximators
approximate multipliers increases while their power consump- is SAGE with a 1L2J multiplier. Approximation-aware train-
tion goes down. Individual multiplication errors accumulate in ing with 1L2J multiplier completely mitigated 4.41% MRE
the matrix multiplication output, resulting in increased GNN influence for GIN and improved SAGE accuracy by 0.015.
prediction error, which is also reflected in Table III. The ef- GIN, trained with a 1L2L approximate multiplier, experienced
fect of approximate multiplications on GNN inference can be a significant performance boost for 12.26% MRE, improv-
noticed immediately with harder datasets. Accuracy starts to ing its accuracy by 0.43. SAGE also demonstrated an accu-
decrease non-linearly upon the introduction of any sort of ap- racy increase of 0.11 with the same approximate multiplier.
proximation. For example, 0.9% MRE of 1KV9 only low- Finally, 1L2N approximation-aware training helped GIN and
ers accuracy by 0.001-0.02 in all cases, but 1L2J multiplier SAGE to recover their accuracy for 27.44% MRE by 0.415
with 4.41% MRE starts to introduce an accuracy decrease up and 0.36 correspondingly.
to 0.065. Finally, the most inaccurate multipliers, represented Approximation-aware training for the node-level task Cora
by 1L2N and 1L12 with 27.44% and 135.77% MRE corre- dataset affects networks in a completely different way (see
spondingly, drive GNNs into a completely unfeasible state. Fig. 3c–3d). As an example, we use GIN and SAGE GNNs on a
Quantization-aware GNNs cannot tolerate any form of approx- Cora dataset since the approximation effect is more noticeable
imation for tasks with higher complexity, since pre-trained in this case. However, other GNNs on a Cora dataset and all
networks are not aware of the approximation multiplication GNNs on a PubMed dataset behave in a similar way. While
behavior within the model. GNNs still try to adapt to a specific multiplier, especially for
To improve GNN performance under approximation, we higher MREs, the magnitude of accuracy difference with other
perform Approximation-Aware Training (AAT), teaching net- unknown-to-the-network multipliers is very low compared to
works to counter multiplication error influence. For this task, the graph-level task. This behavior indicates that GNNs are
we employ custom GNN layers with the previously men- trying to generalize to multiplication errors instead since their
tioned approximate CUDA matrix multiplication kernel. The effect was not significant for node-level tasks in the first place
gradient for approximate-aware matrix multiplication in the (see Table III). However, networks still experienced an accuracy
backward pass is substituted by the gradient of regular ma- increase in the case of the 1L2N multiplier with a high 27.44%
trix multiplication since gradients of approximate functions MRE. One of the reasons for poor AAT performance on the
are not defined. We train GNNs with the four most energy- Cora dataset could be GNN architectures, which in this case
efficient approximators to test the limits of error mitigation are composed only of two consecutive graph-conv layers with
capabilities for AAT. Training is done for both network types 64 parameters. Thus, the low number of network parameters is
and Reddit-Binary, Cora datasets. GNN learning curve during insufficient to memorize the dataset and approximation patterns
the training process is shown in Fig. 2. AAT does not affect altogether. This statement is supported by an experiment for
GNN convergence rate and does not impose any training time the Cora dataset with the same GNN architectures but with 512
overhead over QAT. GNNs with 1L12 approximators did not hidden parameters, i.e. neurons in every linear layer. 8-bit GIN
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
NOVKIN et al.: APPROXIMATION- AND QUANTIZATION-AWARE TRAINING FOR GNNS 607
Fig. 3. Accuracy of 8-bit GIN (left) and SAGE (right) with approximate multipliers and Approximation-Aware Training (AAT). Datasets: Reddit-Binary
(top), Cora (bottom). Plot styles represent different approximate multipliers, which were used during AAT.
and SAGE with 512 hidden parameters reached 0.799 and 0.802 before [40]. MRE of the DRALM multiplier is determined by
accuracy correspondingly. The same networks, trained with the truncated width t with a higher t leading to a lower MRE but
27.44% MRE multiplier (1L2N), were able to achieve 0.792 and higher energy costs. Our experiment is conducted with GIN and
0.797 accuracy respectively, showing smaller deviation from the SAGE GNN architectures on Reddit-Binary and Cora datasets
baseline 8-bit GNNs. This result strengthens the point, that nets with results reported in Fig. 4. All GNNs are able to converge to
may require a sufficient number of parameters to memorize an an optimal solution, demonstrating the highest accuracy starting
approximation behavior, which increases the number of com- from t = 4, which is in line with the reported error tendencies
putations and the overall energy requirement. SAGE with 512 in the DRALM paper.
neurons in linear layers is also able to reach the same accuracy Due to the fundamental differences between GNNs and
as GIN, indicating that SAGE is more sensitive to the num- DNNs, such as processed data type and architecture, the direct
ber of hidden parameters than GIN since GINConv employs comparison of GNN and DNN AAT will not be informative
two consecutive linear layers instead of one (see Fig. 1). Note or even possible. Graph data and datasets cannot be used with
that, better AAT performance with a higher number of GNN regular DNNs, meaning that tasks cannot be compared directly.
parameters is the expected behavior. Increasing the number However, we can compare to a certain degree how much accu-
of network parameters raises their redundancy, which makes racy can be recuperated with AAT in DNN and GNN tasks.
networks more resistant to any form of errors, i.e. quantization For example, the same approach with AAT was used in [12],
or approximation ones. where 8-bit approximate DNN with a 91% multiplication power
Even though in our experiments we employed multipliers reduction experienced an accuracy degradation of 2.8% and
generated through Cartesian genetic programming, GNN AAT 4.5% on MNIST and SVHN datasets correspondingly. In our
approach can be expanded on other multiplier types as well, experiments, the highest multiplication power reduction we use
assuming they produce a deterministic output. One of the major with AAT is a 1L2N multiplier with 78.5% and the accuracy
qualities that affect the accuracy of a GNN is the Mean Relative degradation varies depending on a dataset or a GNN model. For
Error (MRE) of the approximate multiplier, which was observed the Cora dataset, it ranges from 1.3% to 2%, and for the Reddit-
with the EvoApproxLib multipliers. To strengthen this point Binary from 3.5% to 5%. However, with a 1L12 multiplier with
and demonstrate the applicability of GNN AAT to other multi- 87.1% multiplication energy reduction, GNNs were not able to
plier types, we conduct new additional experiments with a Dy- converge to a reasonable solution. Nevertheless, different DNNs
namic Range Approximate Logarithmic Multiplier (DRALM) with the CIFAR10 dataset demonstrated up to 7.5% accuracy
[29], because it has been explored in the context of DNNs degradation for a slightly more than 40% energy reduction
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
608 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 2, FEBRUARY 2024
Fig. 4. Accuracy of 8-bit GIN (left) and SAGE (right) with Dynamic Range Approximate Logarithmic Multiplier (DRALM) and Approximation-Aware
Training (AAT). Datasets: Reddit-Binary (top), Cora (bottom).
[41]. Thus, the efficiency of AAT greatly depends on the task TABLE IV
and network model. However, it is safe to say that AAT is a DELAY, AREA AND POWER REQUIREMENTS FOR MULTIPLICATION
viable method to prevent significant accuracy degradation in
Bit-Width Delay (ns) Area (μm2 ) Power (mW )
approximate GNNs. Unfortunately, it is not possible to evalu-
ate the accuracy drop quantitatively. Ideally, approximation or FP32 0.589 780.09 1.6930
8-bit 0.384 172.47 0.2919
quantization should provide pure benefits without any accuracy 7-bit 0.328 110.74 0.1696
loss, but it is generally not achievable for higher approximation 6-bit 0.266 83.76 0.1254
errors and lower bits. The decision on how much accuracy 5-bit 0.232 47.82 0.0620
4-bit 0.159 28.21 0.0315
can be sacrificed depends on the task. For example, fine-tuned
quantized language models [42] experience some performance
drop based on metrics, but are still capable of replicating the
text generation capabilities of full-precision models. Thus, the library characterization, circuit synthesis, and static timing anal-
acceptable accuracy drop is selected based on how tolerable it is ysis (STA) are all carried out with the commercial tool flow
for the specific task and how beneficial overall energy savings from Synopsys.
and inference speed up. During synthesis all designs are optimized for maximum
performance, resulting in individual critical path delays, and
thus, different maximum sustainable frequencies for each de-
C. Energy Efficiency of Quantization-Aware and sign. However, for comparable results, the power consumption
Approximation-Aware GNN of each multiplier is evaluated at a fixed frequency of 1 GHz,
To estimate the energy savings that are obtained with approx- which can be sustained by all synthesized designs (i.e., circuit
imate multiplier circuits, we synthesize, evaluate, and compare delay ≤ 1 ns). The total power consumption is evaluated by
different multiplier designs. The high-level (RTL) designs of the the commercial STA tool and includes both static (leakage) and
approximate multipliers are obtained from EvoApproxLib [27]. dynamic (switching) power consumption. The switching power
FP32 single-precision multiplier is a parallel implementation, consumption is estimated by statistical switching probabilities,
based on IEEE 754 standards. 4- to 8-bit accurate multipli- which are also calculated by the commercial tool flow. Results
ers are implementations, derived from EvoApproxLib 1KV6 of the evaluation are shown in Table IV for FP32 and accu-
baseline 8-bit signed multiplier. As the underlying technol- rate 4 to 8-bit multiplication, and in Table V for approximate
ogy node, we employ 14 nm FinFET, fully calibrated against 8-bit multiplication.
14 nm Intel measurements [43] on the industry-standard Fin- Even though FP32 remains the most commonly used format
FET transistor compact model BSIM-CMG [44]. The stan- for GNNs, FP32 multiplication cannot compete with the lower-
dard cell designs, including post-layout parasitics, are obtained bit integer-only one in terms of energy efficiency and compu-
from the 15 nm NanGate open cell library [45]. Standard cell tation speed. Table IV demonstrates, that FP32 multiplication
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
NOVKIN et al.: APPROXIMATION- AND QUANTIZATION-AWARE TRAINING FOR GNNS 609
Fig. 5. Accuracy-to-energy (left) and accuracy-to-delay (right) comparison for quantization and 8-bit approximation. Reddit-Binary dataset with GIN (top)
and SAGE (bottom) GNN. A higher accuracy value indicates better accuracy-to-energy and accuracy-to-delay efficiency.
TABLE V having 42% less energy cost and 15% delay reduction. 1L2L
MRE, DELAY, AREA AND POWER REQUIREMENTS FOR APPROXIMATE 8-BIT multiplier outperforms a 6-bit network by 0.03 with the same
MULTIPLICATION
energy requirements and delay. Finally, the 1L2N multiplier is
APX Type MRE (%) Delay (ns) Area (μm2 ) Power (mW ) almost identical to the 5-bit multiplier energy- and delay-wise,
but also achieves 0.02 better accuracy. Results for the SAGE
w/o AxC 0.00 0.384 172.47 0.2919
1KV8 0.28 0.377 176.01 0.2922 network demonstrate a similar tendency for the Reddit-Binary
1KV9 0.90 0.378 150.45 0.2555 dataset (see Fig. 5c–5d), where approximation-aware SAGE
1KVP 2.73 0.345 151.63 0.2517 with 1L2J multiplier has very close to the baseline 8-bit GNN
1L2J 4.41 0.328 110.74 0.1701
1L2L 12.26 0.266 83.85 0.1255 accuracy. In two other cases, the accuracy for approximation-
1L2N 27.44 0.232 47.92 0.0629 aware SAGE has comparable to GIN difference with corre-
1L12 135.77 0.152 32.98 0.0376 sponding lower-bit variants: 0.02 accuracy increase for 1L2L
and 6-bit, and 0.035 for 1L2N and 5-bit. Overall, 8-bit approx-
imation outperforms quantization on the Reddit-Binary dataset
requires 5.8x more energy than the 8-bit one while having a 1.5x for the same energy efficiency and delay reduction.
higher delay and a 4.5x bigger circuit area. Lowering bit-width As mentioned previously, used approximate GNN architec-
leads to more energy-efficient and faster network inference. tures do not demonstrate any performance increase over quan-
However, approximate 8-bit multiplications also experience a tized ones for node-level tasks. Thus, energy efficiency and
significant delay and energy reduction upon MRE increase, as delay of approximate GNNs are similar or lower than quan-
reported in Table V. Thus, it is essential to determine which so- tized to lower bits GNNs. A summary of the AAT perfor-
lution leads to faster and more energy-efficient GNN inference mance compared to lower-bit QAT is given in Table VI. GIN
with the lowest accuracy penalty. and SAGE demonstrate better or similar accuracy on Reddit-
Regarding energy efficiency evaluation, it is crucial to con- Binary and Cora datasets respectively, but fall off on more
sider network accuracy as well, because the goal is to be as close complex Collab and PubMed datasets. Among tested GNNs,
as possible to a FP32 model. For energy and delay evaluation, GCN is the least tolerant towards AAT, since the GCN operator
we employ relative to 8-bit energy cost Eri = Ei /E8bit and cir- has the lowest number of trainable parameters compared to
cuit delay Dri = Di /D8bit . For performance evaluation relative other GNNs. On the other hand, GAT with AAT demonstrated
to FP32 accuracy Ari = Ai /AF P 32 is used. We demonstrate superior accuracy in all applicable cases most likely due to
detailed energy and delay evaluation results for GIN and SAGE an additional accuracy boost from an attention mechanism.
GNNs on a Reddit-Binary dataset but provide a summarized Note, that GCN accuracy after AAT is lower than without it
performance review for all GNNs and datasets later in Table VI. as per Table III. ATT introduces additional inaccuracies into
Fig. 5(a)–5(b) shows results for GIN and Reddit-Binary the training process in the form of the gradient estimation for
datasets. Approximation-aware GIN with 1L2J multiplier approximate operation. As was mentioned before, this estimator
reaches the same performance as the regular 8-bit network while is required to enable backpropagation during training. However,
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
610 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 2, FEBRUARY 2024
TABLE VI
8-BIT GNN AAT ACCURACY DIFFERENCE (IN %) COMPARED TO LOWER-BIT QAT WITH THE SAME
ENERGY PER MULTIPLICATION
Fig. 6. Energy, power, and area comparison of the generated 8×8 systolic multiply-accumulate arrays. Array synthesis is performed using a 14nm FinFET
technology calibrated with Intel measurements. Approximate multipliers are from the EvoApproxLib [27], “w/o AxC” denotes precise 8-bit multiplication
without approximate computations.
it may lead to complications in the network’s ability to con- quantify the overall gain of approximate multiplication within
verge due to a difference between an estimated gradient and a an accelerator.
real one, associated with an approximate multiplier. Thus, in For our system-level evaluation, we synthesized 8×8 sys-
the case when the original error does not affect the network’s tolic multiply-accumulate (MAC) arrays using the framework,
accuracy significantly, the benefits of AAT may not outweigh capable of generating the hardware description of a full array,
its drawbacks. including different types of MAC units, in which approximate
multipliers from EvoApproxLib [27] are employed. The frame-
work is automated to carry out register-transfer level (RTL) sim-
D. Accelerator-Level Benefits of Approximate Multiplication ulations to verify functionality and perform synthesis using a
According to EvoApproxLib [27], precise signed 8-bit mul- 14 nm FinFET technology calibrated with Intel measurements,
tiplication requires about 12 times more energy than a signed as the approximate multipliers before. Gate-level simulations
8-bit addition. Even though in real-life scenarios accelerators are also performed with 100 different matrices for an accurate
will employ higher-bit adders for accumulation to avoid over- power analysis. Systolic array design follows the non-stationary
flow when multiplying large matrices, it is still more crucial systolic array (NSA) architecture and since MACs are intended
to optimize the multiplication operation. Considering the rela- for matrix multiplication, we employ 19-bit signed adders to
tively low energy cost of addition and poor potential energy accumulate the result of approximate matrix multiplications,
gains for the network, we did not approximate it to avoid which is sufficient for used GNN architectures. Results for the
the introduction of additional errors. However, adders and energy, power, and area comparison on a fixed 1GHz frequency
other necessary components of MAC units still contribute to are shown in Fig. 6. We excluded the 1L12 approximate mul-
the total system power requirements. Thus, it is essential to tiplier since none of the GNNs converged to any reasonable
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
NOVKIN et al.: APPROXIMATION- AND QUANTIZATION-AWARE TRAINING FOR GNNS 611
solution with it. As expected, the highest energy, power, and [7] M. Fey and L. Jan Eric, “Fast graph representation learning with PyTorch
area reduction can be observed for the 1L2N multiplier, which Geometric.” GitHub. Accessed: Jul. 25, 2023. [Online]. Available:
https://github.com/pyg-team/pytorch_geometric
induces the highest GNN accuracy penalty. For example, with [8] J. Halcrow, A. Mosoi, S. Ruth, and B. Perozzi, “Grale: Designing
a 17.7% power and 24.5% area reduction, the 1L2N multiplier networks for graph learning,” in Proc. 26th ACM SIGKDD Int. Conf.
lowers the baseline 8-bit GIN and SAGE accuracy on a Reddit- Knowl. Discovery Data Mining, Aug. 2020, pp. 2523–2532.
[9] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural
Binary dataset by 3% and 5.5% correspondingly (from 94.5% networks for RGBD semantic segmentation,” in Proc. IEEE Int. Conf.
and 93%). In the same experimental setup, the 1L2L multiplier Comput. Vision (ICCV), 2017, pp. 5209–5218.
incurs 1% and 2% accuracy penalties, while providing an 8.8% [10] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and
power and 15.5% area reduction. Finally, the 1L2J multiplier K. Keutzer, “A survey of quantization methods for efficient neural
network inference,” 2021, arXiv:2103.13630.
does not lower the baseline accuracy for GIN and reduces it by [11] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, “Varia-
0.5% for SAGE, giving a 6.7% power and 5.4% area reduction tional convolutional neural network pruning,” in Proc. IEEE/CVF Conf.
for the accelerator. Other approximate multipliers, shown in Comput. Vision Pattern Recognit. (CVPR), 2019, pp. 2775–2784.
[12] V. Mrázek, S. S. Sarwar, L. Sekanina, Z. Vasícek, and K. Roy, “Design
Fig. 6, do not incur any accuracy penalty while providing their of power-efficient approximate multipliers for approximate artificial
respective bonuses. neural networks,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des.
(ICCAD), 2016, pp. 1–7.
V. CONCLUSION [13] G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, “Hardware
approximate techniques for deep neural network accelerators: A survey,”
In this paper, we explored quantization-aware and ACM Comput. Surv., vol. 55, pp. 1–36, Mar. 2022.
[14] S. A. Tailor, J. Fernández-Marqués, and N. D. Lane, “Degree-
approximation-aware GNNs and demonstrated, that approxi- quant: Quantization-aware training for graph neural networks,” 2020,
mate multiplication is a viable method to improve GNN arXiv:2008.05000.
inference speed and energy efficiency. Approximation-aware [15] L. Zhao, Q. Wu, X. Wang, T. Tian, W. Wu, and X. Jin, “HuGraph:
Acceleration of GCN training on heterogeneous FPGA clusters with
GNN training with approximate multipliers has the potential quantization,” in Proc. IEEE High Perform. Extreme Comput. Conf.
to mitigate the error influence completely, assuming that the (HPEC), 2022, pp. 1–7.
Mean Relative Error (MRE), introduced by approximation, [16] Y. Wang, B. Feng, and Y. Ding, “QGTC: Accelerating quantized graph
neural networks via GPU tensor core,” in Proc. 27th ACM SIGPLAN
is not high. Approximate multipliers with higher MREs can Symp. Princ. Pract. Parallel Program., 2021, pp. 107–119.
push inference speed and energy efficiency even further, but [17] B. Feng, Y. Wang, X. Li, S. Yang, X. Peng, and Y. Ding, “SGQuant:
GNN accuracy degrades depending on the magnitude of errors. Squeezing the last bit on graph neural networks with specialized
We presented an efficient method to accelerate approximate quantization,” in Proc. IEEE 32nd Int. Conf. Tools Artif. Intell. (ICTAI),
2020, pp. 1044–1052.
multiplication simulation within GNNs. This method allowed [18] J. Wang, Y. Wang, Z. Yang, L. Yang, and Y. Guo, “Bi-GCN: Binary
us to perform approximation-aware training without noticeable graph convolutional network,” in Proc. IEEE/CVF Conf. Comput. Vision
time overhead. We also demonstrated that approximate Pattern Recognit. (CVPR), 2020, pp. 1561–1570.
[19] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu,
multiplication in GNN inference can be a higher-performing “A comprehensive survey on graph neural networks,” IEEE Trans.
alternative to lower-bit GNN quantization. However, a wide Neural Netw. Learn. Syst., vol. 32, pp. 4–24, Jan. 2021.
variety of GNN operators significantly complicates the search [20] T. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” 2016, arXiv:1609.02907.
for a solution, applicable to all possible GNN architectures. [21] C. Zhuang and Q. Ma, “Dual graph convolutional networks for
Thus, the choice between 8-bit approximation multiplication graph-based semi-supervised classification,” in Proc. World Wide Web
and lower-bit quantization for efficient GNN inference greatly Conf., 2018, pp. 499–508.
[22] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph
depends on GNN architecture and task.
neural networks?” 2018, arXiv:1810.00826.
[23] H. Jiang, F. J. H. Santiago, H. Mo, L. Liu, and J. Han, “Approximate
ACKNOWLEDGMENT arithmetic circuits: A survey, characterization, and recent applications,”
Proc. IEEE, vol. 108, no. 12, pp. 2108–2135, Dec. 2020.
The authors would like to thank Somaya Mansour and Paul [24] S. Venkataramani, A. Sabne, V. Kozhikkottu, K. Roy, and A. Raghu-
Genssler for performing the analysis of the systolic array. nathan, “SALSA: Systematic logic synthesis of approximate circuits,”
in Proc. DAC Des. Automat. Conf., 2012, pp. 796–801.
[25] Y. Wu and W. Qian, “ALFANS: Multilevel approximate logic synthesis
REFERENCES framework by approximate node simplification,” IEEE Trans. Comput.-
[1] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation Aided Design Integr. Circuits Syst., vol. 39, no. 7, pp. 1470–1483,
learning on large graphs,” in Proc. NIPS, 2017, pp. 1025–1035. Jul. 2020.
[2] A. Sanchez-Gonzalez et al., “Graph networks as learnable physics [26] L. Sekanina, Z. Vasicek, and V. Mrazek, “Automated search-based
engines for inference and control,” in Proc. ICML, vol. 80, J. G. Dy functional approximation for digital circuits,” in Approximate Circuits:
and A. Krause, Eds., PMLR, 2018, pp. 4467–4476. Methodologies and CAD. New York, NY, USA: Springer-Verlag, 2019,
[3] K. Do, T. Tran, and S. Venkatesh, “Graph transformation policy network pp. 175–203.
for chemical reaction prediction,” in Proc. 25th ACM SIGKDD Int. Conf. [27] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina. “The basic
Knowl. Discovery Data Mining, 2018, pp. 750–760. library of approximate circuits.” EvoApproxLib. Accessed: Jul. 25, 2023.
[4] G. Zhang, H. He, and D. Katabi, “Circuit-GNN: Graph neural networks [Online]. Available: https://ehw.fit.vutbr.cz/evoapproxlib/
for distributed circuit design,” in Proc. Int. Conf. Mach. Learn., 2019, [28] V. Mrazek, Z. Vasicek, L. Sekanina, H. Jiang, and J. Han, “Scalable
pp. 7364–7373. construction of approximate multipliers with formally guaranteed worst
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image case error,” IEEE Trans. Very Large Scale Integr., vol. 26, no. 11,
recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. pp. 2572–2576, Nov. 2018.
(CVPR), 2016, pp. 770–778. [29] P. Yin, C. Wang, H. Waris, W. Liu, Y. Han, and F. Lombardi, “Design
[6] O. Russakovsky et al., “ImageNet Large Scale Visual Recogni- and analysis of energy-efficient dynamic range approximate logarithmic
tion Challenge,” Int. J. Comput. Vision (IJCV), vol. 115, no. 3, multipliers for machine learning,” IEEE Trans. Sustain. Comput., vol. 6,
pp. 211–252, 2015. no. 4, pp. 612–625, Oct./Dec. 2021.
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.
612 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 2, FEBRUARY 2024
[30] Y. Bengio, N. Léonard, and A. C. Courville, “Estimating or propagating Florian Klemme (Member, IEEE) received the
gradients through stochastic neurons for conditional computation,” 2013, B.Sc. degree in system integration with the Univer-
arXiv:1308.3432. sity of Applied Sciences Bremerhaven, Germany,
[31] T. N. Kipf and M. Welling, “Semi-supervised classification with graph in 2014, and the M.Sc. degree in computer science
convolutional networks,” in Proc. 5th Int. Conf. Learn. Representations with Karlsruhe Institute of Technology, Germany,
(ICLR), Conf. Track Proc., Toulon, France, OpenReview.net, Apr. 24– in 2018. He is currently working toward the Ph.D.
26, 2017, arXiv:1609.02907. degree with the Chair of Semiconductor Test and
[32] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Reliability, University of Stuttgart. His research
Y. Bengio, “Graph attention networks,” in Proc. Int. Conf. Learn. interests include cell library characterization and
Representations, 2018, arXiv:1710.10903. machine learning techniques in electronic design
[33] C. Parra, A. Guntoro, and A. Kumar, “ProxSim: GPU-based sim- automation and computer-aided design, specifically
ulation framework for cross-layer approximate DNN optimization,” toward the reliability of transistors, and integrated circuits in advanced
in Proc. Des., Automat. Test Eur. Conf. Exhib. (DATE), Mar. 2020, technology nodes.
pp. 1193–1198.
[34] F. Vaverka, V. Mrazek, Z. Vasicek, and L. Sekanina, “TFApprox: Hussam Amrouch (Member, IEEE) received the
Towards a fast emulation of DNN approximate hardware accelerators Ph.D. degree with the highest distinction (Summa
on GPU,” in Proc. 23rd Conf. Des., Automat. Test Eur., Mar. 2020, cum laude) from KIT, in 2015. He is a Professor
pp. 294–297. heading the Chair of AI Processor Design, Tech-
[35] D. Danopoulos, G. Zervakis, K. Siozios, D. Soudris, and J. Henkel, nical University of Munich (TUM). Additionally,
“AdaPT: Fast emulation of approximate DNN accelerators in PyTorch,” he is heading the Brain-inspired Computing at Mu-
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 42, no. 6, nich Institute of Robotics and Machine Intelligence
pp. 2074–2078, Jun. 2023. (MIRMI), Germany, and is also the Head of the
[36] A. Paszke et al., “PyTorch: An imperative style, high-performance Semiconductor Test and Reliability (STAR), Uni-
deep learning library,” in Proc. Adv. Neural Inf. Process. Syst. 32, San versity of Stuttgart, Germany. Prior to that, he was
Francisco, CA, USA: Curran Associates, Inc., 2019, pp. 8024–8035. the Research Group Leader with Karlsruhe Institute
Accessed: Jul. 25, 2023. [Online]. Available: https://pytorch.org/ of Technology (KIT), where he was leading the research efforts in building
[37] E. Trommer, B. Waschneck, and A. Kumar, “Combining gradients and dependable embedded systems. He currently serves as an Editor for the Nature
probabilities for heterogeneous approximation of neural networks,” in Scientific Reports Journal. His main research interests include design for
Proc. 41st IEEE/ACM Int. Conf. Comput.-Aided Des., Aug. 2022. reliability and testing from device physics to systems, machine learning for
[38] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and CAD, HW security, approximate computing, and emerging technologies with
M. Neumann, “TUDataset: A collection of benchmark datasets for a special focus on ferroelectric devices. He holds 10x HiPEAC Paper Awards
learning with graphs,” 2020, arXiv:2007.08663. and three best paper nominations at top EDA conferences: DAC’16, DAC’17,
[39] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi- and DATE’17 for his work on reliability. He has served on the technical
supervised learning with graph embeddings,” 2016, arXiv:1603.08861. program committees of many major EDA conferences, such as DAC, ASP-
[40] R. Pilipović, P. Bulić, and U. Lotrič, “A two-stage operand trimming DAC, and ICCAD, and as a Reviewer in many top journals like Nature Elec-
approximate logarithmic multiplier,” IEEE Trans. Circuits Syst. I, Reg. tronics, IEEE TRANSACTIONS ON ELECTRON DEVICES, IEEE TRANSACTIONS
Papers, vol. 68, no. 6, pp. 2535–2545, Jun. 2021. ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, IEEE TRANSACTIONS ON
[41] V. Mrazek, Z. Vasicek, L. Sekanina, M. Hanif, and M. Shafique, VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, IEEE TRANSACTIONS
“ALWANN: Automatic layer-wise approximation of deep neural network ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, and
accelerators without retraining,” in Proc. IEEE/ACM Int. Conf. Comput.- IEEE TRANSACTIONS ON COMPUTERS. He has more than 245 publications in
Aided Des. (ICCAD), Nov. 2019, pp. 1–8. multidisciplinary research areas (including more than 100 journals and one
[42] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: in Nature Communications) across the entire computing stack, starting from
Efficient finetuning of quantized LLMs,” 2023, arXiv:2305.14314. semiconductor physics to circuit design all the way up to computer-aided
[43] S. Natarajan et al., “A 14nm logic technology featuring 2nd-generation design and computer architecture. His research in HW security and reliability
FinFET, air-gapped interconnects, self-aligned double patterning and a has been funded by the German Research Foundation (DFG), Advantest
0.0588 μm2 SRAM cell size,” in Proc. IEEE Int. Electron Devices Corporation, and the U.S. Office of Naval Research (ONR).
Meeting, Piscataway, NJ, USA: IEEE Press, 2014, pp. 3–7.
[44] J. P. Duarte et al., “BSIM-CMG: Standard FinFET compact model for
advanced circuit design,” in Proc. ESSCIRC Conf. 41st Eur. Solid-State
Circuits Conf. (ESSCIRC), Piscataway, NJ, USA: IEEE Press, 2015,
pp. 196–201.
[45] M. Martins et al., “Open cell library in 15nm FreePDK technology,” in
Proc. Symp. Int. Symp. Physical Des., 2015, pp. 171–178.
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on January 24,2024 at 19:49:09 UTC from IEEE Xplore. Restrictions apply.