Content Augmented Graph Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Content Augmented Graph Neural Networks

Fatemeh Gholamzadeh Nasrabadi∗1 , AmirHossein Kashani†1 ,


Pegah Zahedi‡1 , and Mostafa Haghir Chehreghani§1
1
Department of Computer Engineering, Amirkabir University of Technology (Tehran
Polytechnic), Tehran, Iran
arXiv:2311.12741v1 [cs.LG] 21 Nov 2023

Abstract
In recent years, graph neural networks (GNNs) have become a pop-
ular tool for solving various problems over graphs. In these models, the
link structure of the graph is typically exploited and nodes’ embeddings
are iteratively updated based on adjacent nodes. Nodes’ contents are
used solely in the form of feature vectors, served as nodes’ first-layer
embeddings. However, the filters or convolutions, applied during itera-
tions/layers to these initial embeddings lead to their impact diminish and
contribute insignificantly to the final embeddings. In order to address this
issue, in this paper we propose augmenting nodes’ embeddings by embed-
dings generating from their content, at higher GNN layers. More precisely,
we propose models wherein a structural embedding using a GNN and a
content embedding are computed for each node. These two are combined
using a combination layer to form the embedding of a node at a given
layer. We suggest methods such as using an auto-encoder or building a
content graph, to generate content embeddings. In the end, by conduct-
ing experiments over several real-world datasets, we demonstrate the high
accuracy and performance of our models.

Keywords— Graphs (networks), graph neural networks, node embed-


ding (representation), content embedding, structural embedding

1 Introduction
Graphs are an important tool to model data and their relationships. They
are used in many domains and applications such as social networks, scientific
networks, and protein protein interaction networks. Several data analysis and
machine learning tasks and problems, including classification, regression, clus-
tering and link prediction, have been studied for nodes of a graph or for a
collection of graphs. The state of the art recent algorithms for solving these
problems rely on computing embeddings (representations) for nodes or graphs.
In the embedding (representation) learning task, each node or each graph is
[email protected]
[email protected]
[email protected]
§ [email protected] (corresponding author)

1
mapped to a low-dimensional vector space, so that nodes or graphs that are
similar to each other in the graph space, should find similar embeddings in the
vector space. Graph neural networks (GNNs) provide a powerful and popular
tool to compute embeddings.
A variety of graph neural networks have been proposed in the literature to
generate high-quality embeddings that can enhance the performance of different
tasks. They mostly rely on the link structure of the graph and iteratively update
the embedding of a node according to its embedding and the embeddings of its
neighbors in the previous iteration (layer). As a result, the local neighborhood
of a node plays a critical role in its computed final embedding. However, nodes
of a graph usually contain a rich content that can be used to improve several
tasks such as node classification and clustering. Existing GNNs mostly use this
content only in the form of feature vectors that are fed into them, as nodes’ first-
layer embeddings. However, consecutive filters (convolutions) that are applied
during various iterations/layers, reduce the impact of nodes’ feature vectors,
so that they find little influence on nodes’ final embeddings. As a result, the
discriminative power of nodes’ content information, which can be useful in many
applications, is mostly ignored.
Motivated by this observation and in order to preserve the impact of nodes’
contents at higher GNN layers, in this paper we propose novel methods that
augment nodes’ embeddings with nodes’ content information, at higher GNN
layers. More precisely, we propose a model wherein a structural embedding us-
ing a GNN and a content embedding are computed for each node. These two
are combined using a combination layer to form the final embedding of a node
at a given layer. We propose two methods to generate content embeddings of
nodes. In the first method, we build it using an auto-encoder applied to nodes’
initial feature vectors to improve them. In the second method, we construct
it by forming a content-similarity graph and applying a GNN to this content
graph to obtain nodes’ content embeddings. Our content augmentation meth-
ods are independent of the used GNN model and can be aligned with any of
them. In this paper, we apply them to three well-known graph neural net-
works: GCN [15], GAT [22] and GATv2 [3]. By conducting experiments over
several real-world datasets, we show that our content augmentation techniques
considerably improve the performance of GNNs.
The rest of this paper is organized as follows. In Section 2, we provide an
overview on related work. In Section 3, we present preliminaries and defini-
tions used in the paper. In Section 4, we present our methods for augmenting
graph neural networks with nodes’ content information. We empirically evalu-
ate the performance of our proposed methods in Section 5. Finally, the paper
is concluded in Section 6.

2 related work
Kipf and Welling [15] introduced the concept of graph convolutional networks
(GCNs), with a primary emphasis on node classification. They illuminated
that graph signal processing convolutions essentially amass feature insights from
neighboring nodes encompassing a designated node. Their method involves
generating node embeddings rooted in localized neighborhoods crafted around
individual nodes. In GraphSAGE [10], Hamilton et al. proposed the utilization

2
of a versatile aggregation function Agg, as follows:
 n o 
zu(k) = sig W (k) · Agg zu(k−1) , ∀u ∈ N (v) , B (k) zv(k−1) ,

where matrices W (k) and B (k) are trainable weight matrices and sig is the
sigmoid function.
Velickovic et al. [22] introduced Graph Attention Networks (GATs), where
they considered non-equal importance weights for the neighbors of a node. They
suggested using an attention mechanism α. Then the attention coefficient be-
tween a pair of nodes v and u is defined as follows:
 
evu = α W (k) zu(k−1) , W (k) zv(k−1) ,

where matrices W (k) are trainable weight matrices. Jin et al. [13] introduced
SimP-GCN, a graph convolutional network composed of the subsequent steps.
Initially, it employs an adaptive technique to fuse the structural and node fea-
tures. Then, it employs a learning approach to forecast pairwise feature simi-
larity derived from the hidden embeddings of designated node pairs.
Alon and Yahav [1] studied the challenge of over-squashing in deep graph
neural networks, a problem that intensified with the growing number of GNN
layers. They suggested maintaining the essential information flow from the
initial layers to a central bottleneck layer. They also proposed a mechanism that
creates new edges, connecting crucial nodes to the target node. In the GIANT
model [6], the authors addressed feature scarcity in graph-agnostic contexts by
leveraging a pre-trained BERT model to predict neighborhoods of raw textual
data. They extracted feature vectors from the BERT model and seamlessly
integrated them into the GNN model.
Li et al. [17] proposed a general class of structure related features, called
Distance Encoding (DE), that captures the distance between the node set whose
representation is to be learned and each node in the graph. It can apply different
graph-distance measures such as shortest path distance or PageRank scores. You
et al. [26] presented the class of Identity-aware Graph Neural Networks (ID-
GNNs). ID-GNN extends GNNs by inductively considering nodes’ identities
during message passing. It first extracts the ego network centered at the node,
then performs rounds of heterogeneous message passing, where different sets of
parameters are applied to the center node than to other surrounding nodes.
Huang et al. [12] proposed an approach to mitigate the computational com-
plexity of GNN models. In their model, GNN serves as a post-processing layer,
while the initial layer uses the power of a multi-layer perceptron (MLP). Thost
and Chen [21] introduced the DAGNN model to enhance the performance of
multi-layer GNN models for acyclic graphs. DAGNN processes nodes based
on the partial order stipulated by the Directed Acyclic Graph (DAG), utilizing
current-layer information to compute the present-layer representation of the tar-
get node. Chehreghani [5] argued that the explainability of GNN models is on
par with that of decision trees and, consequently, surpasses the explainability
of conventional neural networks. Brody et al. [3] proposed a straightforward
modification to GAT’s attention function, by reordering its internal operations.
They introduced the GATv2 method, which exhibits enhanced expressiveness
compared to GAT.

3
For an in-depth exploration of GNNs, interested readers can refer to [23].
Another aspect of GNNs is their ability to handle dynamic graphs, as demon-
strated by works such as [14, 4]. These methods address scenarios that require
effective updates to computed embeddings after modifications of the graph, such
as: node insertion, node deletion, edge addition, edge deletion, and weight ad-
justments. The literature also includes shallow approaches [18, 8, 7], where the
embeddings of nodes are considered as parameters that should be learned. How-
ever, these shallow models often exhibit inferior performance than the GNNs,
especially in learning tasks such as node classification.

3 Preliminaries
Throughout the paper, we use the following standard for notations and symbols:
lowercase letters for scalars, uppercase letters for sets, multisets and graphs, bold
lowercase letters for vectors and bold uppercase letters for matrices. We assume
the reader is familiar with basic concept in graph theory. By G = (V, E), we
refer to a graph whose node set is V and edge set is E. By n we denote the
number of nodes of the graph. We assume that each node has a feature vector
of size d. By X ∈ R(n×d) , we denote the matrix of nodes’ feature vectors. By
finding vector embeddings for nodes of a graph, we want to encode them as
low-dimensional vectors that summarize the structure of the graph. A node
embedding function g is a mapping that maps each node in the graph into a
low-dimensional vector of real values [9]. The most popular methods to generate
vector embeddings are graph neural networks.
At a high level, GNNs consist of the following steps:

i) for each node v, a neighborhood N (v) is constructed,


ii) at the first layer, the embedding of node v consists of its features (or e.g.,
the zero vector, if attributes are absent), and
iii) at each layer k +1, the embedding of node v is computed using a function f
that takes as input the layer-k embedding of v and the layer-k embeddings
of the nodes in the neighborhood of v. Function f consists of the following
components: i) a linear transformation defined by (at least) one trainable
weight matrix W that converts a lower layer embedding into a message, ii)
an activation function σ, which is element-wise applied to the each gener-
ated message to induce non-linearity to the model, and iii) an aggregation
function agg which takes a number of vectors as input, aggregates them,
and generates a vector as the higher layer embedding of the node.
As a result, function f is defined as follows:
  
zv(k+1) = f(v, N (v)) = agg σ W · zu(k) , ∀u ∈ N (v) ∪ {v} , (1)

(k+1) (k)
where zv and zu are respectively the vector embeddings of v at layers
k + 1 and k. A widely used activation function is ReLU, defined as follows:
ReLU(x) = max(0, x).

4
4 Content augmented graph neural networks
As already mentioned, our primary concern in this paper lies in the vanishing
of essential first-layer information, specifically content information, as it propa-
gates through multi-layer convolutional GNNs. Consequently, the models tend
to over-rely on graph structural information during the final decision-making
process and pay less attention to nodes’ content. In this section, we present
two effective models that are designed to augment GNNs with content infor-
mation. The first model is proper for the supervised setting, wherein enough
labeled examples should be accessible during training. This is because of the
auto-encoder, utilized within this model: although its encoder/decoder compo-
nents are inherently unsupervised, in our model, they should be trained with
enough number of examples to learn dimension reduction patterns, effectively.
The second model works fine in the semi-supervised setting, wherein a limited
amount of labeled data is accessible during training. It constructs a content
graph, alongside the input graph, and applies GNN models to both of them.
GNNs are known for their superior performance in the semi-supervised setting.
As a result, the second model demonstrates outstanding performance in this
setting.
In the rest of this section, first, we briefly discuss how the input graph
is preprocessed, before feeding it into GNN models. Then we describe our
supervised and semi-supervised content augmentation methods.

4.1 Preprocessing

In our data preprocessing phase, we adopt a bag-of-words approach [16] for con-
tent processing, creating the initial vectors that will be employed by the GNN
layers. The bag-of-words technique represents the content information of the
input data, serving as a foundation for subsequent graph-based operations. It
should be noted that due to variations in the datasets, this approach results
in different first-layer vector sizes. We will elaborate on this in Section 5, pro-
viding insights into the impact of diverse dataset characteristics on the GNNs’
performance.
Furthermore, to enhance information flow and preserve local content within
the graph, it is a common technique to augment the graph structure by adding
self-loops to each node. Our experiments show that the addition of self-loops
improves the performance in cases where the dataset structure is denser than
usual. For example, self-loops boost both the base model and our proposed
model in the DBLP and BlogCatalog datasets. However, in the case of Cora
and CiteSeer, we observe a decrease in performance. We consider this self-
loop addition as a hyperparameter and report the best results obtained when
considering or ignoring this feature. We also apply a dropout layer to GNNs to
prevent overfitting.

4.2 Supervised content augmentation of graph neural net-


works (AugS)
Figure 1 presents the high level structure of our proposed model for content
augmentation of graph neural networks, in the supervised setting. This model

5
is compatible with any graph neural network and does not depend on the specific
model used. As mentioned earlier, this model is primarily effective in supervised
scenarios, as deep auto-encoders tend to have limited functionality in cases of
data scarcity. We refer to this supervised content augmentation model as AugS-
GNN. In the following, we describe each component of the model in details. 1

Figure 1: The high-level architecture of AugS-GNN. Red units are computa-


tional and blue units are data.

4.2.1 Structural and content embeddings

As mentioned earlier, we create two distinct embeddings for each node. First,
we utilize the GNN model, which incorporates both the graph structure and
features generated from the bag-of-words approach. In this way, a structural
embedding is built for each node. Then, in order to add a stronger dimension
of content information, at higher GNN layers and for each node, we generate
a content embedding. This is done by feeding the first-layer embedding (the
initial feature vector) of each node into an auto-encoder. The output of the
encoder component of the auto-encoder serves as the content embedding.
For auto-encoder, we use the model based on multiple MLP encoder layers
and multiple MLP decoder layers, introduced by Hinton [11]. In the encoder
part, each layer’s size is reduced by half compared to the previous layer, while
the decoder part follows an opposite pattern, with each layer’s size being dou-
bled relative to the preceding one. The unsupervised loss function used during
training is defined by Equation 2, which helps train the the model’s parameters:
n
X
J(θ) = ||fdec ◦ fenc (x(i) ) − x(i) ||22 , (2)
i=0

where Θ is the set of trainable parameters of the encoders and decoders, operator
◦ is function composition, fenc and fdec respectively denote the encoder and
decoder functions (each consists of several MLP layers) and x(i) represents the
initial feature vector of node i. The loss function computes the squared L2 norm
of the distance vector between the decoder’s composition of the encoder’s output
and the original input. Using it, we try to train the auto-encoder parameters in a
way that each vector, after encoding and decoding, becomes close to its original
form as much as possible. The summation over n considers feature vectors of
all nodes of the training dataset. After completing the training phase, we use
the output of the last encoder (the input of the first decoder), as the content
embedding. This layer is called the bottleneck layer. It is worth highlighting that
the parameters of the auto-encoder are not jointly learned with the parameters
of the whole model, as a separate loss function is used to train the model. We set
the input dimension based on the number of node features and the bottleneck
dimension to 64.
1 Our implementation of AugS is publicly available at: https://github.com/amkkashani/

AugS-GNN

6
4.2.2 Combination layer

In this layer, we combine the structural and content embeddings obtained for
each node, to form a unique embedding for it. Our combination layer consists
of two phases: the fusion phase, wherein the two structural and content em-
beddings are fused to form a single vector; and the dimension reduction phase,
wherein the dimentionality of the vector obtained from the first phase is reduced.
For the first phase, we can explore several fusion functions, including: concate-
nation where the two vectors are concatenated, sum where an element-wise sum
is applied to the vectors and max where an element-wise max is applied to the
vectors. Our Experiments demonstrate that concatenation consistently outper-
forms the other fusion methods across all datasets. Therefore in this paper, we
specifically highlight the results achieved using the concatenation function.
For the second phase, we incorporate an MLP, whose parameters are trained
jointly with the other parameters of the model (unlike the parameters of the
auto-encoder used to generate the content embedding).

4.2.3 Prediction head

The embeddings generated by AugS can be used as the input for several down-
stream tasks and problems such as classification, regression, and sequence la-
beling. The method used to address these tasks is called the prediction head.
Various machine learning techniques, such as SVD, decision trees, and linear
regression, can be used in the prediction process. In the experiments of this
paper, our focus is to assess the model’s performance in classifying nodes. To
do so, we use an MLP, consisting of two dense hidden layers each one with 16
units, as the prediction head. The parameters of the prediction head are jointly
trained with the other parameters of the whole model.
We train our model using the cross entropy loss function [20]. For a single
training example, it is defined as follows:
c
X
− Tk log(Sk ), (3)
k=1

where c is the number of classes and Tk and Sk respectively represent the true
probability and the estimated probability of belonging the example to class k.
The total cross entropy is defined as the sum of cross entropies of all training
examples.
We use the following setting to train the model. We set the number of epochs
to 200, batch size to 32, dropout ratio to 0.05 (in the MLP of the prediction
head, we set it to 0.2), and the number of GNN layers to 2. We use the Adam
algorithm as the optimizer and ReLU as the activation function in the hidden
layers and softmax in the output layer.

4.3 Semi-supervised content augmentation of graph neu-


ral network (AugSS)
In this section, we present our second augmentation method, specifically tailored
to work in a semi-supervised setting. This approach involves the construction

7
of an auxiliary graph based on nodes’ content attributes, which is subsequently
integrated with the input graph during GNN processing. By adopting this
strategy, we infuse the GNN with the influence of nodes’ contents, effectively
mitigating the issue of content information degradation within the GNNs. We
refer to this semi-supervised content augmentation model as AugSS-GNN. Sim-
ilar to AugS, this approach is compatible with any graph neural network and
does not rely on the particular model employed. 2

4.3.1 Content graph construction

Our second approach for content augmentation in GNNs involves the creation
of a ”content graph”, which is fed into the GNN framework. Henceforth, in this
paper, we refer to the input graph as the ”structural graph”, to distinguish it
from the content graph. The construction of the content graph is done through
the utilization of inherent features within the graph nodes. Here are two steps
for building the content graph:

1. We compute pairwise similarities between initial feature vectors of nodes.


We can employ diverse metrics suitable for assessing the likeness of two
vectors, such as Euclidean distance, dot product, or cosine similarity. In
this paper, we specifically employ cosine similarity.
2. If the similarity value of the feature vectors of two nodes exceeds a certain
threshold ϵ, we create an edge between the two nodes. It is important to
note that the threshold value varies across different datasets. In this study,
we employ a grid search methodology to determine the optimal threshold
for each dataset.

4.3.2 Model’s architecture

Our semi-supervised strategy for integrating the content graph into the training
procedure of graph neural networks is illustrated in Figure 2. Apart from the
structural graph, we introduce the content graph as an input to the graph neural
network. GNNs take a feature matrix of graph nodes as input. Therefore, both
the structural and content graphs require an initial feature matrix to be provided
as input to the GNN. In our approach, we employ the same feature matrix X as
the initial feature matrix, for both the structural graph and the content graph.
We examined alternative techniques, such as vectors generated by the DeepWalk
algorithm [19], or initializing the feature matrix for the content graph with a set
of structural graph features. However, these methods did not yield satisfactory
results. Let’s denote the structural graph by G and the content graph by G′ . All
graph neural networks used in this section have two graph convolution layers.
First, a graph convolution layer, called Conv1, is applied to each of the graphs,
considering the adjacency matrix specific to that graph. It is important to note
that each of these graph convolutions has its own weight parameters, and they
do not share trainable weight parameters.
2 Our implementation of AugSS is publicly available at: https://github.com/
FatemehGholamzadeh/AugSS-GNN

8
Suppose that Conv1 maps input vectors with dimension d to output vectors
with dimension d′ . After applying this layer to G and G′ , we will have two
output matrices with the same size n × d′ . In fact, for each node we will have
two vectors of dimension d′ . At this stage, we combine these two vectors using
an aggregation method, which could be, for example, element-wise averaging or
summing, or their concatenation, to obtain a vector of dimension 2d′ . Then,
we reduce the dimensionality of this vector to obtain a vector of dimensions d′ ,
which is used as input to the next layer.
In our experiments, we also employed an alternative aggregation method
that led to improved results. We introduced two scalar trainable parameters,
w1 and w2 , serving as the weights for the structural graph and content graph
outputs, respectively. As a result, if we denote the structural graph output with
h1 and the content graph output with h2 , the combined output can be computed
as follows:
h = w1 × h1 + w2 × h2 . (4)
We train these two weights throughout the network’s training process, along
with the other parameters. This dynamic training approach allows us to adap-
tively determine the significance of each embedding component.
The second layer of the graph neural network (Conv2), which is also the final
layer, takes input vectors of dimension d′ and maps them to vectors of dimension
c, where c is the number of classes in node classification tasks. Therefore, by
applying the softmax function, the class of each node is determined. Similar to
the AugS method, we use cross entropy as the loss function. Figure 2 illustrates
our proposed model’s architecture.

Figure 2: The high-level architecture of AugSS-GNN.

5 Experiments
In this section, by conducting experiments over several real-world datasets, we
show that our content augmenting techniques considerably improve the per-
formance of GNNs. We choose three well-known GNN models to improve by
our augmentation techniques: GCN [15], GAT [22] and GATv2 [3]. The ex-
periments are conducted on a Colab T4 GPU with 16 GB VRAM available for
computation.

9
Table 1: Summary of our real-world datasets.
Dataset #nodes #edges #attributes #classes
CiteSeer 3312 4732 3703 6
Cora 2708 5278 1433 7
Wiki 2405 17981 4973 17
DBLP 17716 105734 1639 4
BlogCatalog 5196 343486 8189 6

5.1 Datasets
To evaluate our models, we utilize five widely used datasets: Cora [25], Cite-
Seer [25], DBLP [2], BlogCatalog [24] and Wiki [24]. The specifications of these
real-world datasets are summarized in Table 1. We select these datasets for
evaluation due to their inclusion of both graph-based and textual content. We
partition each dataset into training, validation, and test sets.

5.2 Evaluation measures


We adopt accuracy and macro-F1 as the evaluation criteria. Accuracy mea-
sures the proportion of correctly classified instances. Macro-F1 accounts for
both precision and recall, considering class imbalances and providing a balanced
assessment of the model’s performance. They are formally defined as follows.
number of correctly classified test examples
accuracy = × 100. (5)
total number of test examples
T Pclass(k)
precisionclass(k) = , (6)
T Pclass(k) + F Pclass(k)
where T Pclass(k) represents the number of correctly predicted examples in class k
and F Pclass(k) indicates the number of examples predicted as class k but belong
to other classes.
T Pclass(k)
recallclass(k) = , (7)
T Pclass(k) + F Nclass(k)
where F Nclass(k) is defined as the number of examples in class k that are pre-
dicted as belonging to other classes.
2 × precisionclass(k) × recallclass(k)
F 1class(k) = , (8)
precisionclass(k) + recallclass(k)
and
c
1 X
macro-F 1 = × F 1class(k) , (9)
c
k=1

where c is the number of classes.

5.3 Results
In this section, we present the results of our experiments, in both supervised
and semi-supervised settings.

10
Table 2: Comparing accuracy scores of the basic GNN models against their
augmented counterparts, using the AugS method.
CORA CITESEER WIKI DBLP BLOGCAT
GCN 86.92±0.26 73.32±0.08 74.62±0.15 80.21±0.02 71.88±2.97
AugS-GCN 87.41±0.55 72.44±0.38 75.58±0.94 83.09±0.13 86.04±1.33
GAT 85.11±0.00 75.83±0.13 73.47±0.74 72.83±0.019 59.05±0.41
AugS-GAT 87.65±0.46 74.76±0.52 73.64±0.60 83.00±0.19 72.52±0.96
GATv2 87.10±0.11 75.90±0.14 72.96±0.58 67.08±0.62 62.29±0.33
AugS-GATv2 87.52±0.37 74.43±0.73 73.40±1.10 83.69±0.25 70.94±0.81

5.3.1 Supervised content augmentation

In the supervised setting, we split each dataset in half for training and test-
ing. We report accuracy and macro-F1 scores for both the GNN models and
the AugS-GNN models. We run each model for 10 times and report the aver-
age results as well as the standard deviations. The accuracy results reported
in Table 2 depict improvements over most of the cases. In a number of cases
(DBLP and BLOGCAT), the improvements of AugS are significant. Addi-
tionally, the macro-F1 scores reported in Table 3 demonstrate noticeable im-
provements of AugS, indicating the effectiveness of our suggested model over
imbalance datasets that have classes with few instances.

Table 3: Comparing macro F1-scores of the basic GNN models against their
augmented counterparts, using the AugS method.
CORA CITESEER WIKI DBLP BLOGCAT
GCN 84.87±0.12 69.33±0.07 64.00±0.53 64.60±0.06 69.62±4.68
AugS-GCN 86.48±0.50 87.10±0.11 59.89±3.01 78.73±0.22 85.637±1.34
GAT 82.47±0.03 71.59±0.18 52.03±2.20 50.20±0.054 58.19±0.40
AugS-GAT 86.22±0.65 71.48±0.79 58.79±2.14 79.08±0.36 71.88±0.95
GATv2 84.87±0.12 71.50±0.17 51.47±2.30 39.47±0.39 61.615±0.37
AugS-GATv2 85.80±0.62 71.21±0.11 58.77±2.21 80.12±0.39 70.57±0.78

5.3.2 Semi-supervised content augmentation

In this section, we assess the performance of AugSS-GNN, within a semi-supervised


framework. To provide a limited amount of training data for the models, we
select 20 labeled samples per class from our datasets and randomly sample 500
data points for validation. Additionally, we allocate 1000 data points for evalu-
ating the models. In the case of the wiki dataset, some classes have fewer than
20 samples, prompting us to select just 5 training samples from these classes.
We present accuracy (Table 4) and macro-F1 scores (Table 5) for both the base
models (GCN, GAT, and GATv2) and their AugSS models. These reported
scores are based on the results of 10 runs of the models, where the average ac-
curacy and macro-F1 scores as well as the standard deviations are provided in
the tables.
Our experiments illustrates the impact of incorporating a content graph into
the training process of graph neural networks. By using AugSS, across various

11
datasets, we observe noticeable improvements in accuracy and F1-score, ranging
from approximately 2% to 14%. Therefore, AugSS considerably enhances the
performance of GNNs, when a limited amount of labeled data is available during
the training phase. This improvement in the performance of GNN models is due
to the role of nodes’ content information in their discrimination, which is better
modeled and reflected to the final classifier, using our content augmentation
method.

Table 4: Comparing accuracy scores of the basic GNN models against their
augmented counterparts, using the AugSS method.
CORA CITESEER WIKI DBLP BLOGCAT
GCN 80.81 ± 0.60 69.32 ± 0.69 68.65 ± 1.26 74.92 ± 1.03 75.38 ± 0.83
AugSS-GCN 82.48 ± 0.93 72.859 ± 0.55 72.61 ± 0.84 77.31 ± 0.65 72.87 ± 0.92
GAT 81.17 ± 1.21 68.14 ± 1.51 52.89 ± 3.92 75.4 ± 0.96 63.54 ± 2.59
AugSS-GAT 82.14 ± 0.71 72.463 ± 0.53 59.5 ± 0.74 76.33 ± 0.71 67.6 ± 2.83
GATv2 80.64 ± 0.62 68.2 ± 1.11 50.61 ± 3.31 74.85 ± 1.36 67.57 ± 2.10
AugSS-GATv2 81.62 ± 0.42 72.50 ± 0.69 59.7 ± 1.55 76.42 ± 0.82 81.28 ± 1.33

Table 5: Comparing macro F1-scores of the basic GNN models against their
augmented counterparts, using the AugSS method.
CORA CITESEER WIKI DBLP BLOGCAT
GCN 79.917 ± 0.6 65.627 ± 0.9 58.775 ± 1.6 70.943 ± 0.9 74.702 ± 0.9
AugSS-GCN 81.355 ± 0.9 69.119 ± 0.6 62.781 ± 1.1 74.03 ± 0.8 71.994 ± 1.0
GAT 80.362 ± 1.0 68.459 ± 0.0 43.291 ± 2.9 72.109 ± 0.8 62.991 ± 2.2
AugSS-GAT 81.035 ± 0.9 70.457 ± 2.1 48.855 ± 1.5 72.813 ± 0.8 66.79 ± 2.6
GATv2 79.745 ± 0.6 64.628 ± 0.9 41.602 ± 2.1 71.879 ± 1.3 66.414 ± 2.3
AugSS-GATv2 80.659 ± 0.6 67.785 ± 0.5 47.623 ± 1.7 72.948 ± 1.2 80.742 ± 1.4

6 Conclusion
In this paper, we proposed novel methods that augment nodes’ embeddings with
their content information at higher GNN layers. In our methods, for each node
a structural embedding and a content embedding are computed and combined
using a combination layer, to form the embedding of the node at a given layer.
We presented techniques such as using an auto-encoder or building a content
graph to generate nodes’ content embeddings. By conducting experiments over
several real-world datasets, we showed that our models considerably improve
the accuracy of GNN models.

References
[1] Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and
its practical implications. arXiv preprint arXiv:2006.05205, 2020.
[2] Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embed-
ding of graphs: Unsupervised inductive learning via ranking. arXiv preprint
arXiv:1707.03815, 2017.

12
[3] Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph atten-
tion networks? In International Conference on Learning Representations,
2022.
[4] Mostafa Haghir Chehreghani. Dynamical algorithms for data mining and
machine learning over dynamic graphs. Wiley Interdiscip. Rev. Data Min.
Knowl. Discov., 11(2), 2021.
[5] Mostafa Haghir Chehreghani. Half a decade of graph convolutional net-
works. Nature Machine Intelligence, 4(3):192–193, 2022.
[6] Eli Chien, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Jiong Zhang,
Olgica Milenkovic, and Inderjit S Dhillon. Node feature extraction
by self-supervised multi-scale neighborhood prediction. arXiv preprint
arXiv:2111.00064, 2021.
[7] Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. metapath2vec:
Scalable representation learning for heterogeneous networks. In Proceed-
ings of the 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017,
pages 135–144. ACM, 2017.
[8] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning
for networks. In Balaji Krishnapuram, Mohak Shah, Alexander J. Smola,
Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi, editors, Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages
855–864. ACM, 2016.
[9] Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for
networks. InProceedings of KDD 2016, 2016.
[10] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive represen-
tation learning on large graphs. In Isabelle Guyon, Ulrike von Luxburg,
Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan,
and Roman Garnett, editors, Advances in Neural Information Processing
Systems 30: Annual Conference on Neural Information Processing Systems
2017, 4-9 December 2017, Long Beach, CA, USA, pages 1024–1034, 2017.
[11] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimension-
ality of data with neural networks. science, 313(5786):504–507, 2006.
[12] Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R Benson.
Combining label propagation and simple models out-performs graph neural
networks. arXiv preprint arXiv:2010.13993, 2020.
[13] Wei Jin, Tyler Derr, Yiqi Wang, Yao Ma, Zitao Liu, and Jiliang Tang.
Node similarity preserving graph convolutional networks. In 14th ACM
International Conference on Web Search and Data Mining (WSDM) 2021,
Jerusalem, Israel, March 8-12, 2021. ACM, 2021.
[14] Seyed Mehran Kazemi, Rishab Goel, Kshitij Jain, Ivan Kobyzev, Akshay
Sethi, Peter Forsyth, and Pascal Poupart. Representation learning for dy-
namic graphs: A survey. J. Mach. Learn. Res., 21:70:1–70:73, 2020.

13
[15] Thomas N. Kipf and Max Welling. Semi-supervised classification with
graph convolutional networks. In 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Confer-
ence Track Proceedings. OpenReview.net, 2017.
[16] Quoc Le and Tomas Mikolov. Distributed representations of sentences and
documents. In International conference on machine learning, pages 1188–
1196. PMLR, 2014.
[17] Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance en-
coding: Design provably more powerful neural networks for graph represen-
tation learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell,
Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural
Information Processing Systems 33: Annual Conference on Neural Infor-
mation Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, vir-
tual, 2020.
[18] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learn-
ing of social representations. In Sofus A. Macskassy, Claudia Perlich, Jure
Leskovec, Wei Wang, and Rayid Ghani, editors, The 20th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD
’14, New York, NY, USA - August 24 - 27, 2014, pages 701–710. ACM,
2014.
[19] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learn-
ing of social representations. ACM KDD, 2014.
[20] Claude Elwood Shannon. A mathematical theory of communication. The
Bell system technical journal, 27(3):379–423, 1948.
[21] Veronika Thost and Jie Chen. Directed acyclic graph neural networks.
arXiv preprint arXiv:2101.07965, 2021.
[22] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero,
Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th Inter-
national Conference on Learning Representations, ICLR 2018, Vancouver,
BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open-
Review.net, 2018.
[23] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang,
and Philip S. Yu. A comprehensive survey on graph neural networks. IEEE
Trans. Neural Networks Learn. Syst., 32(1):4–24, 2021.
[24] Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, Juncheng Liu, and
Sourav S. Bhowmick. Scaling attributed network embedding to massive
graphs. CoRR, abs/2009.00826, 2020.
[25] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-
supervised learning with graph embeddings. In International conference on
machine learning, pages 40–48. PMLR, 2016.
[26] Jiaxuan You, Jonathan M. Gomes-Selman, Rex Ying, and Jure Leskovec.
Identity-aware graph neural networks. In Thirty-Fifth AAAI Conference on
Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative

14
Applications of Artificial Intelligence, IAAI 2021, The Eleventh Sympo-
sium on Educational Advances in Artificial Intelligence, EAAI 2021, Vir-
tual Event, February 2-9, 2021, pages 10737–10745. AAAI Press, 2021.

15

You might also like