Hierarchical graph-based text classification framework with contextual
Hierarchical graph-based text classification framework with contextual
a r t i c l e i n f o a b s t r a c t
Article history: We propose a novel hierarchical graph-based text classification framework that leverages the power of
Received 6 May 2023 contextual node embedding and BERT-based dynamic fusion to capture the complex relationships
Revised 26 May 2023 between the nodes in the hierarchical graph and generate a more accurate classification of text. The
Accepted 3 June 2023
framework consists of seven stages: Linguistic Feature Extraction, Hierarchical Node Construction with
Available online 13 June 2023
Domain-Specific Knowledge, Contextual Node Embedding, Multi-Level Graph Learning, Dynamic Text
Sequential Feature Interaction, Attention-Based Graph Learning, and Dynamic Fusion with BERT. The first
Keywords:
stage, Linguistic Feature Extraction, extracts the linguistic features of the text, including part-of-speech
Text classification
Hierarchical graph
tags, dependency parsing, and named entities. The second stage constructs a hierarchical graph based
Pre-trained language models on the domain-specific knowledge, which is used to capture the relationships between nodes in the
Contextual embedding graph. The third stage, Contextual Node Embedding, generates a vector representation for each node in
Attention mechanism the hierarchical graph, which captures its local context information, linguistic features, and domain-
specific knowledge. The fourth stage, Multi-Level Graph Learning, uses a graph convolutional neural net-
work to learn the hierarchical structure of the graph and extract the features of the nodes in the graph.
The fifth stage, Dynamic Text Sequential Feature Interaction, captures the sequential information of the
text and generates dynamic features for each node. The sixth stage, Attention-Based Graph earning, uses
an attention mechanism to capture the important features of the nodes in the graph. Finally, the seventh
stage, Dynamic Fusion with BERT, combines the output from the previous stages with the output from a
pre-trained BERT model to obtain the final integrated vector representation of the text. This approach
leverages the strengths of both the proposed framework and BERT, allowing for better performance on
the classification task. The proposed framework was evaluated on several benchmark datasets and com-
pared to state-of-the-art methods, demonstrating significant improvements in classification accuracy.
Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
https://doi.org/10.1016/j.jksuci.2023.101610
1319-1578/Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
(Wu et al., 2020). However, the development of deep learning tech- occurrence, dependency parsing, or knowledge graphs. In the
niques, such as convolutional neural networks (CNNs) and recur- learning phase, the graph is used to extract features and classify
rent neural networks (RNNs), has enabled the automatic learning the documents (Yao et al., 2019; Wang et al., 2023). Graph-based
of features directly from raw text data, which has greatly improved approaches offer several advantages over traditional methods, such
the accuracy of text classification (Otter et al., 2020). Despite the as the ability to capture semantic relationships between words and
success of deep learning approaches in recent years, text classifica- exploit the structural properties of the data (Ragesh et al., 2021).
tion still faces several challenges, including the ability to effec- However, there are still several obstacles to improving the effec-
tively capture the complex relationships between words and tiveness of graph-based approaches for text classification. One
concepts in a given text, and the difficulty in incorporating major challenge is the difficulty of capturing long-range interac-
domain-specific knowledge into the classification process tions between words, which can be critical for accurately classify-
(Malekzadeh et al., 2021). To address these challenges, researchers ing documents. Another challenge is the sparsity of the graph,
have proposed various techniques such as hierarchical classifica- which can lead to poor performance when using standard graph-
tion, graph-based models, and contextualized embeddings (Wang based algorithms (Piao et al., 2022).
et al., 2023). Graph-based approaches for text classification typically con-
Graph Neural Networks (GNNs) are a type of deep learning struct text graphs by representing the text as a set of nodes con-
model that have gained popularity in recent years for their ability nected by edges that capture various relationships between the
to effectively model graph-structured data, such as social net- nodes. However, constructing these graphs is not a trivial task,
works, molecular structures, and text data represented as hierar- and current approaches often rely on heuristics and domain-
chical graphs (Zhou et al., 2020). In the context of text specific knowledge to determine the appropriate structure of the
classification, GNNs have been shown to outperform traditional graph. This can limit the effectiveness of the resulting classification
sequential learning models such as Convolutional Neural Networks model, as important relationships between words and concepts
(CNNs) and Recurrent Neural Networks (RNNs) by leveraging the may not be captured by the graph (Koncel-Kedziorski et al.,
complex relationships between the nodes in a hierarchical graph 2019). Another challenge in graph-based text classification is
representation of the text (Wu et al., 2023). The key difference how to effectively combine different sources of contextual infor-
between GNNs and sequential learning models is that GNNs oper- mation, such as syntactic and semantic features, to improve classi-
ate directly on the graph structure, while sequential models pro- fication performance. Researchers have attempted to address this
cess the input text in a sequential manner. GNNs can capture the challenge by using deep learning models that can learn to auto-
structural information of the graph and leverage it to better repre- matically extract and combine these features in a data-driven man-
sent the text data, while sequential models may struggle to model ner. For example, recent studies have explored the use of graph
complex relationships between nodes in the graph (Liu and Wu, convolutional networks (GCNs), which can capture both the local
2022). GNNs achieve this by propagating information from neigh- and global structures of the text graph to better incorporate differ-
boring nodes to update the node representations iteratively. This ent sources of contextual information. Additionally, pre-trained
allows the model to capture the dependencies and interactions language models such as BERT have been used to encode contex-
between nodes in the graph and improve the overall performance tual information for each node in the graph, allowing for better
of text classification tasks (Wu et al., 2021). GNNs provide a pow- representation learning and classification performance (Wang
erful framework for modeling graph-structured data such as hier- et al., 2023). Despite these advancements, there are still challenges
archical graph representations of text data. By leveraging the to be addressed in graph-based text classification, such as dealing
relationships between nodes in the graph, GNNs can capture the with noisy and incomplete data, and designing effective graph
complex interactions between different elements of the text and structures that capture relevant semantic relationships between
provide superior performance compared to traditional sequential words and concepts. The proposed framework for hierarchical
learning models (Vashishth et al., 2020). graph-based text classification with contextual node embedding
In recent years, pre-trained language models have made signif- and BERT-based dynamic fusion is a novel approach that over-
icant contributions to improving the performance of text classifica- comes key limitations of existing models. The main contributions
tion tasks (Qiu et al., 2020). These models are trained on large of the study include:
amounts of unlabeled text data to learn general language represen-
tations, and can then be fine-tuned on a specific task, such as text The integration of linguistic features, domain-specific knowl-
classification. One of the most widely used pre-trained models is edge, and contextual node embeddings into a hierarchical graph
the Bidirectional Encoder Representations from Transformers structure using a pre-trained language model (BERT).
(BERT), which has achieved state-of-the-art results on a variety The incorporation of multi-level graph learning and attention-
of natural language processing tasks (Devlin et al., 2018). Other based graph learning to capture both the local and global rela-
notable pre-trained models include the Generative Pre-trained tionships between nodes in the text graph.
Transformer 2 (GPT-2) (Radford et al., 2019), the Transformer-XL The dynamic fusion of the outputs from the previous stages
(Dai et al., 2019), and the Universal Language Model Fine-tuning with the output from a pre-trained language model to obtain
(ULMFiT) (Howard and Ruder, 2018). Pre-trained models offer sev- a final integrated vector representation of the text, allowing
eral advantages for text classification, including the ability to lever- for better performance on the classification task.
age large amounts of data to learn powerful language
representations, the ability to transfer knowledge from one task The manuscript is organized as follows: Section 2 presents a lit-
to another, and the ability to reduce the amount of labeled data erature review on the related works in the field of text classifica-
needed for training. These advantages have made pre-trained mod- tion, graph neural networks, and pre-trained language models.
els a popular choice for text classification tasks, particularly for Section 3 presents the proposed hierarchical graph-based text clas-
tasks with limited labeled data (Li et al., 2021; Qiu et al., 2020). sification framework with contextual node embedding and BERT-
Text classification using graph structures can be divided into based dynamic fusion, describing each individual stage in detail.
two phases: constructing the graph structure and learning from Section 4 presents the experimental setup, including the datasets,
it (Yao et al., 2019). In the construction phase, a graph is built based evaluation metrics, and experimental results. Finally, Section 5 pro-
on the document collection using various techniques such as co- vides a conclusion and highlights the contributions of the study.
2
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
3
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
semantic associations, leading to more accurate text classification the model. The individual components of our proposed approach in
results. Furthermore, our model employs BERT-based dynamic the text classification framework are carefully justified to address
fusion, which dynamically combines the contextual node embed- specific challenges in the field. The utilization of hierarchical nodes
dings with BERT representations. This fusion process allows the aims to capture the hierarchical relationships and semantic struc-
model to leverage the power of both the hierarchical graph struc- ture inherent in text data, allowing for a more comprehensive
ture and the rich contextual information provided by BERT. By understanding of the textual content. By constructing these hierar-
integrating these two sources of information, our model can effec- chical nodes, we can effectively represent the nested relationships
tively capture the complex relationships between nodes and gen- between concepts, enabling a more fine-grained analysis of the
erate predictions that are more accurate. text data. To capture the contextual information and semantic
meaning of words within the hierarchical nodes, we employ con-
3. Proposed text classification framework textual node embedding. This approach utilizes pre-trained lan-
guage models such as BERT or GPT to encode the contextual
The proposed Dynamic Graph-based Text Analysis Framework information of words, considering their context-dependent nature.
using Attention Mechanisms is a powerful technique for text clas- By leveraging these contextual embeddings, we can better grasp
sification tasks that involve complex relationships between nodes the nuanced meanings and semantic relationships between words,
in a hierarchical graph. enhancing the model’s ability to comprehend the text data. Fur-
The general structure of the proposed scheme has been outlined thermore, our framework incorporates dynamic fusion with BERT
in Fig. 1. The framework consists of several stages, including Lin- to capitalize on the powerful contextual representations offered
guistic Feature Extraction, Hierarchical Node Construction with by this pre-trained language model. By dynamically fusing the con-
Domain-specific Knowledge, Contextual Node Embedding, Multi- textual node embeddings with BERT representations, our model
level Graph Learning, Attention-based Graph Learning, and benefits from both the hierarchical structure captured by the nodes
Dynamic Fusion with BERT. The framework leverages pre-trained and the rich contextual information captured by BERT. This
language models such as BERT to encode contextual information, integration leads to a more comprehensive understanding of the
linguistic features, and domain-specific knowledge into vector rep- text data and improves the model’s discriminative power. In addi-
resentations for each node in the graph. The attention-based graph tion, attention-based graph learning is employed to capture the
learning stage allows the model to capture the complex relation- relationships and dependencies between different nodes within
ships between nodes, while the multi-level graph learning stage the hierarchical graph. By incorporating attention mechanisms,
enables the model to learn representations at different levels of the model assigns importance weights to different nodes and
abstraction. The dynamic fusion stage combines the outputs from learns to focus on the most relevant information during classifica-
the previous stages with the output from an external model using tion. This attention-based graph learning enables the model to
a learning framework, which improves the overall performance of capture the intricate dependencies between concepts within the
4
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
hierarchical structure, enhancing its overall discriminative capabil- implemented using various algorithms, such as the Stanford
ity. In summary, each component of our proposed approach has CoreNLP library.
been carefully justified based on its ability to address specific chal-
lenges in text classification. The hierarchical node construction
captures the hierarchical nature of text data, contextual node 3.2. Hierarchical node construction with domain-specific knowledge
embedding captures contextual information, dynamic fusion with
BERT leverages pre-trained language models, and attention-based In this stage, the text data is represented as a hierarchical graph,
graph learning captures relationships between nodes. By synergis- where each document is a node, and each sentence within the doc-
tically integrating these components, our approach improves the ument is a sub-node. In addition to this, domain-specific knowl-
accuracy, interpretability, and overall performance of the text clas- edge graph (i.e., WordNet) is incorporated to capture the
sification framework. domain-specific relationships and semantic meanings between
the words in the text. Here are the more details on this stage:
3.1. Linguistic feature extraction 1. Hierarchical Graph Construction: The first step is to con-
struct a hierarchical graph representation of the text data. In this
Linguistic Feature Extraction is a stage in the text classification graph, each document is represented as a node, and each sentence
process that involves extracting various linguistic features from within the document is a sub-node. This can be formalized as fol-
the preprocessed text data. These features can help to capture lows: Let T ¼ ft1 ; t2 ; . . . ; t n g be a set of preprocessed text data,
important information about the structure and meaning of the where t i represents the i-th document in the set. For each docu-
text, which can be used to improve the accuracy of text classifica- ment t i , construct a hierarchical graph Gi ¼ ðN i ; Ei Þ;where N i is
tion models. Here are the details on this stage: the set of nodes representing the document and its sentences,
and Ei is the set of edges representing the hierarchical relationships
Part-of-Speech (POS) Tagging: Part-of-Speech (POS) tagging is a between the nodes. This can be formalized as:
process of assigning a part of speech tag to each word in the N i ¼ fn1 ; n2 ; . . . ; nmi g, where n1 represents the document node
text. This involves analyzing the syntactic context of each word and n2 ; . . . ; nmi represent the sentence nodes within the document.
to determine its grammatical category, such as noun, verb, Ei ¼ fðn1 ; n2 Þ; ðn1 ; n3 Þ . . . ; ðn1 ; nmi Þg, representing the hierarchical
adjective, etc. This can be formalized as follows: For each docu- relationships between the document and its sentences.
ment ti, perform POS tagging to obtain a sequence of part-of- 2. Domain-specific Knowledge Incorporation: The second step
speech tags is to incorporate domain-specific knowledge graphs (e.g., WordNet
pi ¼ fp1 ; p2 ; . . . ; pni g, where pj represents the part-of-speech tag or medical ontologies) to capture the domain-specific relationships
of the j-th word in the document. The POS tagging process can and semantic meanings between the words in the text. This can be
be implemented using various techniques, such as rule-based formalized as follows: For each document ti, incorporate domain-
systems or machine learning models. specific knowledge graphs K i ¼ ðV i ; Ei Þ, where V i is the set of nodes
Dependency Parsing: Dependency parsing is a process of iden- representing the words in the text, and Ei is the set of edges repre-
tifying the relationships between words in the text, such as senting the domain-specific relationships between the words. This
subject-verb or noun-adjective relationships. This involves con- can be formalized as: V i ¼ fv 1 ; v 2 ; . . . ; v ki g, where v j represents the
structing a dependency parse tree for each sentence in the doc- j-th word in the document. Ei ¼ fðv k ; v l Þg if there is a domain-
ument, where each node represents a word in the sentence, and specific relationship between words v k and v l . The domain-
each edge represents a dependency relationship between specific knowledge graphs can be constructed using various
words. This can be formalized as follows: For each document techniques, such as ontology-based methods or distributional
ti, perform dependency parsing to obtain a dependency parse semantics. By incorporating domain-specific knowledge in the
tree Di ¼ ðV; EÞ, where V is the set of nodes representing the hierarchical node construction stage, the model becomes more tai-
words in the document, and E is the set of edges representing lored to the specific domain, capturing its unique characteristics.
the dependency relationships between the words. The depen- This improves the accuracy of the classification process by
dency parsing process can be implemented using various algo- enabling the model to focus on the most relevant aspects of the
rithms, such as the Stanford Parser or the spaCy library. text data. Furthermore, the utilization of domain-specific knowl-
Named Entity Recognition (NER): Named Entity Recognition edge enhances the interpretability and explainability of the model.
(NER) is a process of identifying and classifying named entities The hierarchical nodes derived from domain-specific knowledge
in the text, such as people, organizations, and locations. This provide a more intuitive representation of the underlying concepts
involves detecting the presence of named entities in the text and relationships in the domain. This allows users to understand
and assigning them to predefined categories. This can be for- how the model is making predictions and provides insights into
malized as follows: For each document ti, perform NER to obtain the factors influencing the classification decisions. In summary,
a set of named entities N i ¼ fn1 ; n2 ; . . . ; nmi g, where nj repre- the utilization of domain-specific knowledge in the hierarchical
sents the j-th named entity in the document, along with its cor- node construction stage enhances the accuracy, interpretability,
responding type. The NER process can be implemented using and relevance of the model for the specific domain. By incorporat-
various techniques, such as rule-based systems or machine ing domain-specific terminologies, taxonomies, expert knowledge,
learning models. and external resources, the model can capture the unique charac-
Coreference Resolution: Coreference resolution is a process of teristics of the domain and improve the overall performance of
identifying when two or more words in the text refer to the the text classification framework.
same entity, such as when ‘‘he” refers to a previously mentioned 3. Hierarchical Graph and Knowledge Graph Fusion: The final
person. This involves grouping together all the words that refer step is to fuse the hierarchical graph and domain-specific knowl-
to the same entity and assigning them a single label. This can be edge graph to obtain an integrated graph representation of the text
formalized as follows: For each document t i , perform corefer- data. For each document ti , fuse the hierarchical graph Gi ¼ ðN i ; Ei Þ
ence resolution to identify when two or more words in the and domain-specific knowledge graph K i ¼ ðV i ; Ei Þto obtain an
document refer to the same entity, and group them together integrated graph G0i ¼ ðN0i ; E0i Þ, where N0 i is the set of nodes
into a single entity. The coreference resolution process can be representing the document and its sentences, as well as the words
5
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
in the text, and E0i is the set of edges representing the hierarchical based on its context in the concatenated text. In addition to captur-
relationships between the nodes and the domain-specific relation- ing the local context information of each node, the language model
ships between the words. This can be formalized as: N0i ¼ ðN i [ V i Þ, can also incorporate linguistic features, such as part-of-speech
where V i represents the nodes in the domain-specific knowledge tags, dependency parsing, and named entities, by inputting the text
graph. E0i ¼ ðEi [ Eki Þ, where Eki represents the edges in the along with these features. This allows the language model to learn
domain-specific knowledge graph. The fusion process can be per- how to encode the linguistic properties of the text, which can be
formed using various techniques, such as graph convolutional net- useful for downstream tasks such as text classification. Finally,
works or attention mechanisms. For example, graph convolutional the language model can also incorporate domain-specific knowl-
networks can be used to learn node embeddings that capture both edge by using knowledge graphs such as WordNet or medical
the hierarchical and domain-specific relationships in the graph. ontologies to provide additional context for each node. By incorpo-
Attention mechanisms can be used to weigh the importance of dif- rating domain-specific knowledge into the language model, it can
ferent nodes in the graph based on their relevance to the classifica- learn to encode the semantic relationships between words and
tion task. concepts in the domain, which can be useful for tasks such as
Representing the text data as a hierarchical graph is an efficient entity recognition and relation extraction. The contextual node
approach to text representation, as it captures the hierarchical embedding is a powerful technique for representing the nodes in
relationships between the different components of the text (i.e., a hierarchical graph as vectors that capture their local context, lin-
documents and sentences). This allows the model to better under- guistic features, and domain-specific knowledge. By using a pre-
stand the structure of the text data, which can improve its perfor- trained language model, we can take advantage of the large
mance on text classification tasks. Incorporating domain-specific amounts of text data that are available and learn to encode com-
knowledge graphs into the text representation enables the model plex relationships between nodes in the graph.
to capture the domain-specific relationships and semantic mean- Let G ¼ ðV; EÞ be the hierarchical graph with node set V and
ings between words in the text. This is important because the edge set E. Each node v 2 V represents a piece of text, such as a
meaning of words can vary depending on the domain or context document or a sentence, and has associated text features xv that
in which they are used. By incorporating domain-specific knowl- describe its local context, linguistic properties, and domain-
edge graphs, the model can better understand the meaning of specific knowledge. Let f ðxv ; hÞ be a pre-trained language model
words in the context of the domain being analyzed, which can with parameters h that takes as input the text features xv associ-
improve its accuracy on text classification tasks. Fusing the hierar- ated with each node v and generates a contextualized vector rep-
chical graph and domain-specific knowledge graph into an inte- resentation hv 2 Rd for each node. The contextualized
grated graph representation enables the model to capture both representation hv encodes the local context, linguistic properties,
the hierarchical and domain-specific relationships in the text data. and domain-specific knowledge of the node, and is generated by
This allows the model to learn more informative node embeddings applying a non-linear function g to the output of the language
that capture both the structural and semantic information in the model:
text data. Additionally, the fusion process can be performed using
various techniques, such as graph convolutional networks or atten- hv ¼ gðf ðxv ; hÞÞ ð1Þ
tion mechanisms, which allows for more flexibility and customiza-
where g is a non-linear function, such as a rectified linear unit
tion in the model architecture. The Hierarchical Node Construction
(ReLU) or a sigmoid function. To generate the contextualized vector
with Domain-specific Knowledge stage is a novel approach to text
representation hv for each node v , the text features xv associated
representation that leverages the structure and semantic meaning
with the node are first concatenated with the text features xu asso-
of the text data to improve text classification accuracy. By incorpo-
ciated with each neighboring node u 2 Nðv Þ, where Nðv Þ is the set of
rating hierarchical and domain-specific knowledge into the graph
neighboring nodes of v in the graph. The concatenated text features
representation, the model is better able to understand the complex
are then input to the language model f , which generates the contex-
relationships between the different components of the text and the
tualized vector representation hv for the node. Formally, the input
meaning of the words in the context of the domain being analyzed.
to the language model for node v is defined as:
In Fig. 2, hierarchical, domain-specific knowledge, and integrated
X
graphs that will be generated for a sample text data consisting of xv ¼ ½xv ; xu ð2Þ
three documents, each with three sentences have been illustrated. fu2N ðv Þg
P
3.3. Contextual node embedding where [;] denotes concatenation and denotes summation. This
input concatenates the text features of node v with the sum of
Contextual node embedding is the process of representing each the text features of its neighboring nodes, capturing the local con-
node in the hierarchical graph as a vector that encodes its local text of the node in the graph.
context information, linguistic features, and domain-specific The output of the language model f is a sequence of d-
knowledge. The goal of contextual node embedding is to capture dimensional vectors fh1 ; h2 ; . . . ; hn g, where n is the length of the
the complex relationships between the nodes in the graph, as well input sequence. The contextualized vector representation hv for
as the broader context in which each node appears. This is node v is defined as the corresponding vector in the output
achieved using a pre-trained language model, i.e., BERT, which is sequence:
trained on large amounts of text data to learn how to encode the hv ¼ hi ð3Þ
contextual information of the text. The language model generates
a vector representation, also known as an embedding, for each where i is the index of the token in the input sequence that corre-
word in the text that captures its meaning and context. The same sponds to node v . By using a pre-trained language model to gener-
pre-trained language model can be used to generate embeddings ate the contextualized vector representation for each node in the
for the nodes in the hierarchical graph. To generate the embed- hierarchical graph, we can capture the complex relationships
dings for each node in the hierarchical graph, the text associated between the nodes, as well as the broader context in which each
with each node is first concatenated with the text of its neighbor- node appears. This representation can be used as input to down-
ing nodes. This concatenated text is then fed into the language stream tasks such as text classification, entity recognition, and rela-
model, which generates a vector representation for each node tion extraction, improving the performance of these tasks.
6
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
Fig. 2. The sample hierarchical, domain-specific and integrated graphs generated in Stage 3.2.
Suppose we have a hierarchical graph that represents a collec- entity ‘‘company”. To generate the contextualized vector represen-
tion of news articles. The graph consists of document nodes, which tation for this sentence node, we concatenate its text features with
represent the articles, and sentence nodes, which represent the the text features of its neighboring sentence nodes and input the
sentences within the articles. Each node has associated text fea- resulting sequence to the language model. The language model
tures that describe its local context, such as the words in the sen- generates a d-dimensional vector representation that captures
tence and their part-of-speech tags, as well as domain-specific the meaning and context of the sentence node. This process is
knowledge, such as the named entities mentioned in the text. To repeated for all nodes in the hierarchical graph, generating a con-
generate the contextualized vector representation for each node textualized vector representation for each node. These representa-
in the graph, we use a pre-trained language model such as BERT. tions can be used as input to downstream tasks such as text
We first concatenate the text features of each node with the text classification, entity recognition, and relation extraction, improv-
features of its neighboring nodes, capturing the local context of ing the performance of these tasks by capturing the complex rela-
the node in the graph. We then input this concatenated text to tionships between the nodes in the graph.
the language model, which generates a contextualized vector rep-
resentation for each node. For example, consider the sentence node 3.4. Multi-level graph learning
‘‘The company announced its latest earnings report yesterday.”
This sentence node has associated text features that describe the The Multi-level Graph Learning stage is a critical component of
words in the sentence (‘‘the”, ‘‘company”, ‘‘announced”, ‘‘its”, ‘‘lat- the text classification framework that enables the model to learn
est”, ‘‘earnings”, ‘‘report”, ‘‘yesterday”) and their part-of-speech the representations of the nodes in the hierarchical graph. This
tags (‘‘DT”, ‘‘NN”, ‘‘VBD”, ‘‘PRP$”, ‘‘JJS”, ‘‘NNS”, ‘‘NN”, ‘‘NN”). It also stage employs a combination of Graph Neural Networks (GNNs)
has associated domain-specific knowledge, such as the named and attention mechanisms to capture the contextual information
7
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
from neighboring nodes, linguistic features, and domain-specific Multi-level Graph Learning: The graph is learned at multiple
knowledge, and to enhance the representation learning. The graph levels of granularity (word, sentence, and document levels)
is learned at multiple levels of granularity, which allows the model using a combination of GNNs and attention mechanisms. This
to capture the different levels of information in the text data. For allows the model to capture the different levels of information
instance, at the word level, the model can learn the relationships in the text data and to learn representations that are optimized
between individual words and their neighbors within the context for each level of granularity.
of the sentence. At the sentence level, the model can learn the rela- Output Prediction: The final prediction is obtained by applying
tionships between sentences within a document, and at the docu- a classification layer to the learned representations of the nodes
ment level, the model can learn the relationships between different in the hierarchical graph. The final prediction is obtained by
documents in the corpus. The multi-level graph learning scheme applying a classification layer to the learned representations
consists of the following key steps: of the nodes in the hierarchical graph, as follows:
flg
Node Initialization: Each node in the hierarchical graph is ini- y ¼ softmaxðW y hi Þ ð8Þ
tialized with its contextualized representation obtained from
the Contextual Node Embedding stage. where l is the final level of the graph learning, W y is a learnable
Neighborhood Aggregation: For each node in the graph, its weight matrix, and softmax is the function that normalizes the out-
neighboring nodes are identified based on a pre-defined thresh- put into a probability distribution over the class labels.
old distance in the graph. The contextualized representations of The Multi-level Graph Learning stage enables the model to learn
the neighboring nodes are then aggregated using a Graph Neural representations of the nodes in the hierarchical graph that capture
Network (GNN) to capture the contextual information from the complex relationships between the nodes in the graph, and to
neighboring nodes. The goal of this step is to aggregate the rep- use this information to improve the accuracy of text classification
resentations of the neighboring nodes for each node i in the tasks. By incorporating both the contextualized representations
graph. Let G ¼ ðV; EÞbe the hierarchical graph representing the and the linguistic features and domain-specific knowledge, the
text data, where V is the set of nodes in the graph (including doc- model can capture a wide range of information in the text data,
uments, sentences, and words), and E is the set of edges between and by learning the graph at multiple levels of granularity, the
0 model can capture the different levels of information in the text
the nodes. Let hi be the initial representation of node i in the
data.
graph, obtained from the Contextual Node Embedding stage.
Let N i be the set of neighboring nodes of node i, defined as:
3.5. Dynamic text sequential feature interaction
Ni ¼ fjbði; jÞ 2 E and dði; jÞ rg ð4Þ
where dði; jÞ is the distance between nodes i and j in the graph, The Dynamic Text Sequential Feature Interaction stage is used
and r is a pre-defined threshold distance. The aggregated repre- to capture the sequential information of the entire document,
sentation for node i is then obtained by applying a Graph Neural which is important for many text classification tasks such as senti-
Network (GNN) to the set of neighboring nodes, as follows: ment analysis or predicting the next word in a sentence. This stage
uses a Dynamic Time Warping (DTW) algorithm to align the
flþ1g l l
hi ¼ f ðhi ; aggregatefj2Ni g g hi Þ ð5Þ sequence of word embeddings, which allows for the identification
of important temporal relationships between the words in the text.
where l is the current level of the graph learning, f is a non-linear The DTW algorithm is used to align the sequence of word embed-
activation function, g is a transformation function that maps the dings to identify the optimal path through the sequence of embed-
input representation to a new representation, and aggregate is a dings. The optimal path represents the alignment that minimizes
permutation invariant function that aggregates the representa- the distance between the two sequences. In the context of text
tions of the neighboring nodes. One common choice for aggre- classification, the two sequences are the sequence of word embed-
gate is the max pooling function. dings for a given document and a reference sequence of embed-
dings (e.g., the average embeddings for a particular category).
Linguistic Feature Integration: In addition to the contextual- The general structure of the Dynamic Time Warping (DTW) algo-
ized representations, the linguistic features (such as part-of- rithm used in the Dynamic Text Sequential Feature Interaction
speech tags, dependency parsing, named entities) and stage has been presented in Algorithm 1.
domain-specific knowledge are also integrated into the node
representations using attention mechanisms. Let X be the Algorithm 1. The general structure of the Dynamic Time Warping
matrix of linguistic features and domain-specific knowledge (DTW) algorithm.
for all nodes in the graph. The attention mechanism is defined
as:
h i Inputs: sequence X of length n, sequence Y of length m
l
ai ¼ softmaxðW a hi ; xi Þ ð6Þ Output: DTW distance between X and Y
h i 1. Create a distance matrix D of size ðn þ 1Þ ðm þ 1Þ and
l
where hi ; xi is the concatenation of the current representation and initialize all values to infinity.
the corresponding row of X, and W a is a learnable weight matrix. 2. Set D½0; 0 ¼ 0.
The final representation for node i is then obtained by weighted 3. For i in 1 to n, and j in 1 to m:
summing the current representation and the attention weights, as a. Compute the distance between X ½i and
follows: Y ½j : dist ¼ distanceðX ½i; Y ½jÞ
b. Update the distance matrix:
flþ1g lþ1 flþ1g
hi ¼ hi ; sumfj2Ni g afijg hj ð7Þ D½i; j ¼ dist þ minðD½i 1; j; D½i; j 1; D½i 1; j 1Þ
4. Return D½n; m
where afijg is the attention weight assigned to node j by node i.
8
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
In Algorithm 1, the distance function distanceðX ½i; Y ½jÞ is a function where f task is a function that computes the task-specific query vec-
that computes the distance between two word embeddings X ½i and tor T. The task-specific function is a neural network classifier that
Y ½j. The distance function can be any function that computes the takes as input the integrated vector representation of the entire
distance between two vectors, such as Euclidean distance or cosine document and outputs the predicted class label. For example, let
similarity. The algorithm computes a distance matrix D, where D½i; j we say we have two classes (positive and negative) and a neural net-
represents the minimum distance between the first i elements of work classifier with weights W c and bias bc . Then the task-specific
sequence X and the first j elements of sequence Y. The algorithm function can be defined as:
starts by initializing the distance between the first elements of both
f task ðT Þ ¼ rðW c T þ bc Þ ð10Þ
sequences to 0. Then, for each subsequent element in X and Y, it
computes the distance between the two elements and updates where r is the sigmoid activation function. The output of this func-
the distance matrix D with the minimum distance among the three tion is a scalar value between 0 and 1, representing the probability
possible paths: going down from the previous row, going right from of the input document belonging to the positive class. Next, we
the previous column, or going diagonally from the previous diago- compute the attention weights for each node i in V based on its
nal. Once the distance matrix D is computed, the DTW distance embedding hi and the query vector q:
between the two sequences is simply the value in the bottom right
ei ¼ g ðhi ; qÞai ¼ softmaxðei Þ ð11Þ
corner of the matrix, which represents the minimum distance
between the entire sequences. The DTW algorithm is dynamic where g is a function that computes the attention score between
because it allows for the sequences to be aligned in a non-linear the node embedding hi and the query vector q, and softmax is the
fashion, allowing for variations in the timing and duration of the function that normalizes the attention scores to obtain a probability
events in the sequences. This makes it a useful tool for capturing distribution over the nodes. We then compute a weighted sum of
the sequential information of the entire document, which is impor- the node embeddings to obtain a new representation of the text:
tant for many text classification tasks.
Once the alignment is obtained, a feature interaction mecha- hG ¼ sumðai hi Þ ð12Þ
nism is used to capture the important temporal relationships where hG is the integrated vector representation encoding the text.
between the words in the text. This is achieved by multiplying The Attention-based Graph Learning stage uses an attention
the aligned word embeddings together to create a new set of fea- mechanism to compute attention weights for each node in the
tures that captures the interactions between neighboring words. graph based on its embedding and a task-specific query vector.
The resulting feature vector can be used as input to a machine The attention weights are used to compute a weighted sum of
learning algorithm for text classification. the node embeddings, yielding an integrated vector representation
that captures the important information from the nodes and allows
3.6. Attention-based graph learning the model to focus on the most relevant nodes in the graph.
The Attention-based Graph Learning stage is used to calculate 3.7. Dynamic fusion with pre-trained language models
attention weights for each node in the hierarchical graph and com-
pute a weighted sum of the node embeddings to yield an inte- In this stage, we apply the previous stages of the framework
grated vector representation encoding the text. An attention (i.e., Linguistic Feature Extraction, Hierarchical Node Construction
mechanism is employed to better capture the important informa- with Domain-specific Knowledge, Contextual Node Embedding,
tion from the nodes, allowing the model to focus on the most rel- Multi-level Graph Learning, Dynamic Text Sequential Feature
evant nodes in the graph. The attention weights are calculated Interaction, and Attention-based Graph Learning) to obtain the
based on the similarity between the node embeddings and a query integrated vector representation of the text. Next, we feed the inte-
vector, which is typically learned during the training process. The grated vector representation into BERT, which generates a contex-
similarity can be computed using any function that computes the tualized representation of the text. Finally, we concatenate the
similarity between two vectors, such as dot product or cosine sim- contextualized representation from BERT with the output from
ilarity. The attention weights are then used to compute a weighted the previous stages, and pass the concatenated vector through a
sum of the node embeddings, where the weights serve as the fully-connected layer to obtain the final class label. This approach
weights for the sum. This results in a new representation of the leverages the strengths of both the proposed framework and BERT,
text that captures the important information from the nodes, with allowing for better performance on the classification task.
higher weights given to nodes that are more relevant to the task at Let C be the set of classes, and let X be the input text. Let f C be a
hand. The attention mechanism allows the model to focus on the function that maps X to a set of class probabilities, i.e.,
most relevant nodes in the graph, which can vary depending on
the specific text classification task. For example, in a sentiment f CðxÞ ¼ ½pðcjX Þfc2Cg ð13Þ
analysis task, the model may need to focus more on the words that
where pðcjX Þ is the probability of class c given input text X. The out-
convey sentiment, while in a topic classification task, the model
put of the Dynamic Fusion with BERT is obtained by dynamically
may need to focus more on the words that relate to the topic of
combining the outputs from the previous stages with the outputs
the document.
from a pre-trained BERT model. Let hBERT be the output of the pre-
Let G ¼ ðV; EÞ be a hierarchical graph representing a document,
trained BERT model for input X, and let hi be the output of the i-
where V is the set of nodes and E is the set of edges. Each node i in
th previous stage. The output of the Dynamic Fusion with BERT
V represents a word in the document and is embedded with a con-
stage is defined as:
textualized representation hi , which encodes its local context infor-
mation, linguistic features, and domain-specific knowledge. We g ðhBERT ; h1 ; h2 ; . . . ; hn Þ ¼ f C ðconcatenateðhBERT ; h1 ; h2 ; . . . ; hn ÞÞ ð14Þ
first compute a query vector q that captures the task-specific
information: where concatenateðhBERT ; h1 ; h2 ; . . . ; hn Þ denotes the concatenation of
the output vectors hBERT ; h1 ; h2 ; . . . ; hn from each previous stage, and
f C is a function that maps the concatenated output vector to a set of
q ¼ f task ðTÞ ð9Þ class probabilities. The parameters of the model are optimized by
9
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
minimizing a loss function that measures the difference between length is 17.5 words, and the vocabulary size is 12,090 words
the predicted class probabilities and the true class labels. The loss (Onan, 2019).
function can be defined as:
In Table 1, a summary table for the datasets used in the exper-
L ¼ logðpðyjX ÞÞ ð15Þ
imental analysis has been presented.
where y is the true class label, and pðyjX Þ is the predicted probabil-
ity of the true class label given input text X. 4.2. Baselines
In this section, we present the experimental results and discus- BERT: BERT (Bidirectional Encoder Representations from Trans-
sion of our proposed Hierarchical Graph-based Text Classification formers) is a pre-trained language model developed by Google
Framework with Contextual Node Embedding and BERT-based that utilizes a transformer-based architecture to generate con-
Dynamic Fusion, as well as the baseline models. The experiments textualized word embeddings. BERT has achieved state-of-the-
were conducted on several benchmark datasets to evaluate the art performance in various natural language processing tasks,
effectiveness and generalizability of our proposed framework. We including text classification, question answering, and named
report the classification accuracy, precision, recall, and F1-score entity recognition (Devlin et al., 2018).
to measure the performance of each model. Furthermore, we con- BERT-GAT: It is a graph-based approach for text classification
duct an ablation study to investigate the contribution of each mod- that combines the power of pre-trained language models like
ule in our proposed framework. BERT and the Graph Attention Network (GAT) to capture com-
plex relationships between words and concepts in a text. The
4.1. Datasets approach constructs a graph from the text, with words as nodes
and their relationships as edges, and applies the GAT to learn
The experimental evaluation in this study utilizes several the representations of each node. Then, the representations
benchmark datasets that are commonly used to assess the perfor- are fed into a BERT model to generate the final classification
mance of state-of-the-art models in text classification. The rest of result (Lin et al., 2021).
this section presents a brief explanation of each dataset and its BiLSTM: It stands for Bidirectional Long Short-Term Memory. It
descriptive statistics: is a type of Recurrent Neural Network (RNN) that processes
sequential data in both forward and backward directions.
20NG: The dataset called ‘‘20 Newsgroups” is composed of BiLSTMs are commonly used in Natural Language Processing
around 20,000 documents related to different newsgroups, dis- (NLP) tasks, such as text classification and sentiment analysis,
tributed nearly equally across 20 categories. On average, each to capture the sequential nature of language (Liu et al., 2016).
document is 96.5 words long and the vocabulary size in the CGA2TC: It is a graph-based approach for text classification that
dataset is 42,757 words (Wang et al., 2023). incorporates contrastive learning with an adaptive augmenta-
Airline Twitter dataset: This dataset contains approximately tion strategy to obtain more robust node representation. It con-
14,000 tweets related to major US airlines. Each tweet is labeled structs a text graph by exploring word co-occurrence and
as positive, negative, or neutral. The average tweet length is document word relationships, and introduces an adaptive aug-
14.4 words, and the vocabulary size is 11,168 words (Wan mentation strategy to generate two contrastive views that high-
and Gao, 2015; Onan, 2022). light important edges while reducing noise (Yang et al., 2022).
App dataset: The App dataset consists of approximately CNN-non-static: It utilizes a convolutional layer to extract local
752,937 reviews of mobile apps from the Apple App Store. Each features from the input text and a max-pooling layer to obtain
review is labeled as positive or negative. The average review the most important features. The network is trained end-to-
length is 27.8 words, and the vocabulary size is 20,238 words end using backpropagation with stochastic gradient descent
(He and McAuley, 2016). (Kim, 2014).
MR: The MR dataset consists of movie reviews labeled as posi- FastText: It is a word embedding-based text classification
tive or negative. It contains 10,662 reviews, with an average method that utilizes n-gram features and provides an efficient
length of 19.8 words and a vocabulary size of 18,764 words way of representing the semantics of words and phrases in text
(Pang and Lee, 2005). data (Joulin et al., 2016).
Ohsumed: The Ohsumed dataset is a collection of abstracts from HyperGAT: It is a graph-based approach for text classification
medical research papers. It contains 7400 documents, parti- that uses hypergraphs to model the relationships between
tioned into 23 different medical categories. The average docu- words in a document. It employs dual attention mechanisms
ment length is 135,82 words, and the vocabulary size is to capture the high-order interactions between words and
14,157 words (Hersh et al., 1994). improve the efficiency of feature extraction (Ding et al., 2020).
R52: The R52 dataset is a collection of news articles from the SWEM: It is a text classification approach that utilizes pre-
Reuters newswire service. It contains 9,100 documents, parti- trained word embeddings to encode text. SWEM uses different
tioned into 52 different topics. The average document length pooling mechanisms, such as max-pooling and average-pooling,
is 245.6 words, and the vocabulary size is 49,230 words to obtain a fixed-length document representation (Shen et al.,
(Wang et al., 2023). 2018).
R8: The R8 dataset is a subset of the R52 dataset, containing TensorGCN: It is a graph-based approach for text classification
7,674 documents partitioned into 8 different topics. The aver- that uses a graph tensor to capture the relationships between
age document length is 244.9 words, and the vocabulary size words in a document. In this method, each word is treated as
is 11,692 words (Wang et al., 2023). a node in the graph tensor, and the edges between nodes repre-
Sarcasm dataset: This dataset consists of approximately 40,000 sent the co-occurrence of words in the same document (Yao
tweets labeled as sarcastic or non-sarcastic. The average tweet et al., 2019).
10
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
Table 1
The descriptive information for the benchmark datasets.
Dataset Name Number of Documents Number of Categories Average Document Length Vocabulary Size
20NG 20,000 20 96.5 words 42,757 words
Airline Twitter dataset 14,000 3 (Positive, Negative, Neutral) 14.4 words 11,168 words
App dataset 752,937 2 (Positive, Negative) 27.8 words 20,238 words
MR 10,662 2 (Positive, Negative) 19.8 words 18,764 words
Ohsumed 7,400 23 135.82 words 14,157 words
R52 9,100 52 245.6 words 49,230 words
R8 7,674 8 244.9 words 11,692 words
Sarcasm dataset 40,000 2 (Sarcastic, Non-sarcastic) 17.5 words 12,090 words
TextFCG: The architecture addresses the limitations of tradi- Part-of-Speech (POS) tagging. This means that the model only
tional transductive learning approaches for text classification considers the grammatical structure of the text and extracts
by constructing a single graph for all words in each text and features such as noun, verb, and adjective tags.
labeling the edges with various contextual relations. The text Model Variant-2: It employs Linguistic Feature Extraction (LFE)
graph contains different information of documents and based on Dependency parsing only. This means that instead of
enhances the connectivity of the graph by introducing more using Part-of-Speech (POS) tagging to extract linguistic features,
typed edges, which improves the learning effect of the graph the model uses dependency parsing to identify the grammatical
neural network (Wang et al., 2023). relationships between words in a sentence.
TextFCG-BERT: The architecture combined the TextFCG Model Variant-3: This variant focuses on extracting entities
method, which constructs a graph for text classification based such as people, organizations, and locations from the input text
on fused contextual information, with BERT (Wang et al., 2023). as features to enhance the classification performance. By using
TextGCN: It uses a pre-trained word embedding model to ini- only NER features, this variant aims to investigate the effective-
tialize the word vectors in the graph and then applies GCN lay- ness of named entity recognition in capturing the discrimina-
ers to capture the graph structure and aggregate information tive information for text classification tasks.
from neighboring nodes (Yao et al., 2019). Model Variant-4: This variant utilizes LFE based on coreference
TextING: It is a graph-based approach for inductive text classi- resolution only, where the model extracts coreferent mentions
fication. Unlike traditional approaches that use pre-defined or in the text and assigns them to their respective entities. This
fixed graphs, TextING constructs graphs on the fly, specific to variant does not consider any other linguistic features such as
each document. It employs a sliding window approach to build part-of-speech tagging or dependency parsing.
individual graphs, where each word in a document is repre- Model Variant-5: It utilizes the hierarchical node construction
sented as a node, and relationships between words are used stage without incorporating domain-specific knowledge.
to extract features (Zhang et al., 2020). Model variant-6: This variant removes the contextual node
TextING-M: It is a transductive variant of the TextING model for embedding stage from the proposed framework. Without this
inductive text classification. Unlike the original TextING, which stage, the model is unable to capture the local context informa-
constructs a single large graph from the entire corpus and uses tion and linguistic features of each node in the graph.
it to generate word embeddings, TextING-M constructs a sepa- Model variant-7: This variant removes the dynamic text
rate graph for each document and extracts edges from the large sequential feature interaction stage from the proposed frame-
graph based on the whole corpus (Zhang et al., 2020). work. Without this stage, the model is unable to capture the
Text-level GNN: It is a graph-based approach for text classifica- sequential information of the text and generate dynamic fea-
tion that builds document-level graphs with global parameters tures for each node.
sharing to learn text representations. The model employs graph Model variant-8: This variant removes the attention-based
neural networks (GNNs) to capture the graph structure and graph learning stage from the proposed framework. Without
aggregate information from neighboring nodes (Huang et al., this stage, the model is unable to capture the important features
2019). of the nodes in the graph and may not be able to fully utilize the
TextSSL: It is a graph-based sparse structure learning model for hierarchical structure of the graph.
inductive document classification that addresses challenges Model variant-9: This variant removes the dynamic fusion with
such as word ambiguity, word synonymity, and dynamic con- BERT stage from the proposed framework.
textual dependency (Piao et al., 2022).
T-VGAE (Topic-Enhanced Variational Graph Auto-Encoder): It
is a graph-based text classification model that leverages topic 4.4. Experimental settings
modeling and variational autoencoders (VAEs) to capture the
underlying structure of text documents (Xie et al., 2021). For the empirical analysis, we have adopted the empirical
framework outlined in Wang et al., 2023. We used the provided
4.3. Model variations training and testing sets for each dataset, and randomly selected
10% of the training set as the validation set. By default, we used
To evaluate the proposed framework’s effectiveness, the study two layers of the GNN module and tuned the hyperparameters to
examined nine different model variants. A brief overview of the optimize performance on the validation set. We used the Adam
model variants utilized in the empirical analysis is presented optimizer with a learning rate of 0.001 for Ohsumed, 8e-4 for
below: 20NG, and 3e-4 for the other datasets. Dropout was set at 0.5 for
each module, and we randomly dropped edges with a probability
Model Variant-1: It is a variation of the proposed framework of 0.3 for the best performance. For the word embeddings, we ini-
that only uses Linguistic Feature Extraction (LFE) based on tialized those using pre-trained GloVe vectors with dimension 300.
We froze the word embeddings during model training for inductive
11
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
text classification tasks. Out-of-vocabulary words were replaced baselines on all datasets, indicating the importance of each module
with UNK and randomly sampled from a uniform distribution of in the framework. In particular, variants that remove the dynamic
[0.01, 0.01]. We used default parameter settings for baseline fusion with BERT, attention-based graph learning, or dynamic text
models, as described in their original papers or implementations. sequential feature interaction perform worse than the proposed
The L2 loss weight was set to 5e-5 for R8 and 20NG and 5e-6 for framework, highlighting the effectiveness of these modules in
other datasets. In Table 2, the parameter list for the proposed improving classification performance. Additionally, model variants
scheme has been outlined. that rely on only one type of linguistic feature extraction perform
worse than the proposed framework, indicating that combining
multiple types of linguistic features is beneficial for text classifica-
4.5. Experimental results
tion. Among the baseline models, BERT + GAT, Bi-LSTM, and Hyper-
GAT perform relatively well, while fastText and SWEM perform the
This section covers the experimental results and discussion of
worst. It’s worth noting that BERT, which has been widely used in
our proposed text classification framework, which is based on hier-
many NLP tasks, achieves high performance in most datasets. This
archical graphs, contextual node embedding, and dynamic fusion
suggests that pre-trained language models like BERT can provide
with BERT. We also present the results of the baseline models
strong feature representations for text classification tasks. In terms
and evaluate the effectiveness and generalizability of our frame-
of model variants, we can see that LFE based on POS tagging only
work on various benchmark datasets. To measure the performance
(Model variant-1) contributes significantly to the overall perfor-
of each model, we report accuracy, precision, recall, and F1-score.
mance of the proposed model, indicating that part-of-speech tags
Additionally, we conduct an ablation study to analyze the contri-
can be a useful feature in text classification. On the other hand,
bution of each module in our framework.
removing key components of the proposed framework, such as
In Table 3, the classification accuracy values obtained by the
domain-specific knowledge, contextual node embedding, dynamic
baseline models and the proposed framework have been pre-
text sequential feature interaction, and attention-based graph
sented. The experimental results indicate that the proposed model
learning, all result in a significant decrease in performance. This
achieves the highest accuracy on all datasets. The second highest
suggests that each component of the proposed framework plays
predictive performance among the compared models has been
an important role in improving the classification performance.
achieved by Text-FCG + BERT algorithm, which is a graph and
The experimental results show that the proposed framework is a
transformer-based model. Then, Bi-LSTM and BERT + GAT are the
promising approach for text classification tasks, and the ablation
second and the third highest performing models, respectively.
study sheds light on the contributions of different components to
The empirical results indicate that the utilization of graph-based
the overall performance.
approaches to text classification is effective for capturing the com-
In Fig. 3, the histogram of accuracy for the compared models has
plex relationshipts between words and producing accurate predic-
been presented. Based on the results, we can see that the proposed
tions. On the lower end of the spectrum, we see models such as
model achieves the highest performance in all benchmark datasets
Text-FCG, fastText, and SWEM, which achieve relatively low accu-
with a significant margin compared to the other models. It
racy scores compared to the other models. These models typically
achieves an accuracy of 93.71% for 20 NG, 94.18% for Airline Twit-
rely on simpler approaches to text classification, such as word
ter, 96.97% for App, 89.30% for MR, 74.87% for Ohsumed, 98.75% for
embedding-based representations. In Table 3, the empirical results
R8, 94.11% for R52, and 98.32% for Sarcasm. Among the baseline
for the model variants have been also presented. The results show
models, BERT + GAT, Bi-LSTM, and BERT also achieved relatively
that the proposed framework outperforms all model variants and
high performance, with BERT + GAT and Bi-LSTM outperforming
BERT in most of the datasets. The proposed model uses hierarchical
Table 2 graph-based text classification with contextual node embedding
The parameter settings for the proposed framework. and BERT-based dynamic fusion. It also outperforms the variants
Stage Parameter Default of the proposed model that exclude certain components, such as
Value LFE based on POS tagging only, LFE based on dependency parsing
Linguistic Feature Extraction Maximum Sequence 128
only, LFE based on NER only, LFE based on coreference resolution
Length only, and HNC without domain-specific knowledge. It is also inter-
Hierarchical Node Construction with Window Size for 512 esting to note that some models that use graph-based approaches,
Domain-specific Knowledge Document Splitting such as HyperGAT, TensorGCN, TextGCN, and Text-level GNN, per-
Hierarchical Node Construction with Window Size for 64
form worse than some models that do not use such approaches,
Domain-specific Knowledge Sentence Splitting
Contextual Node Embedding Pre-trained Language BERT such as Text-FCG, SWEM, and fastText. This suggests that while
Model graph-based approaches can be effective, they may not always be
Contextual Node Embedding Batch Size 32 the best choice for text classification tasks.
Contextual Node Embedding Learning Rate 2,00E-05 Based on the design approach of the baseline models and the
Multi-level Graph Learning Number of Levels 3
Multi-level Graph Learning Threshold Distance for 2
proposed framework, we have mainly classified all the compared
Neighbor Nodes schemes into six approaches, as graph and transformer-based
Multi-level Graph Learning GNN Hidden Size 256 approaches, graph-based approaches, inductive learning-based
Multi-level Graph Learning Attention Hidden Size 128 approaches, sequence-based approaches, transformer-based
Dynamic Text Sequential Feature Reference Sequence 10
approaches, and word embedding-based approaches. The first
Interaction Length
Dynamic Text Sequential Feature DTW Window Size 5 approach, graph and transformer-based, includes models that
Interaction combine both graph neural networks and transformer-based archi-
Attention-based Graph Learning Attention Hidden Size 128 tectures to process text data. The second approach, graph-based,
Attention-based Graph Learning Dropout Rate 0.5 includes models that use only graph neural networks to process
Fully-connected Layer Number of Hidden 1
Layers
text data. The third approach, inductive learning-based, includes
Fully-connected Layer Hidden Layer Size 256 models that use inductive learning techniques to generalize to
Fully-connected Layer Activation Function ReLU new, unseen examples of text data. The fourth approach,
Fully-connected Layer Output Size Number of sequence-based, includes models that use recurrent neural net-
Classes
works or other sequence-based models to process text data. The
12
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
Table 3
The classification accuracy values obtained by the models.
fifth approach, transformer-based, includes models that use be observed from the empirical results presented in Fig. 4, the uti-
transformer-based architectures, such as the popular BERT model, lizing both graph neural networks and transformer-based architec-
to process text data. The sixth and final approach, word tures to process text data can yield promising results.
embedding-based, includes models that use word embedding tech- In Fig. 5, the interaction plot of accuracy values for different
niques, such as fastText or GloVe, to process text data. By catego- datasets has been presented. Dataset characteristics play a signifi-
rizing the models into these approaches, the study provides a cant role in determining the performance of models. For instance,
framework for understanding the different design choices and the proposed model achieved the highest performance in all data-
methodologies used in the field of text classification, and can help sets, which could be attributed to the complexity of the datasets
guide future research in this area. In Fig. 4, the interval plot of accu- and the need for sophisticated models to handle them. Among
racy values for different approaches has been presented. As it can the compared models, graph and transformer-based approaches
13
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
14
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
Table 4
The precision values obtained by the models.
15
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
Table 5
The recall values obtained by the models.
Table 6
The F1-score values obtained by the models.
Table 7
The two-way ANOVA test results by the models.
16
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
17
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610
Systems Conference (IntelliSys) Volume 2. Springer International Publishing, Piao, Y., Lee, S., Lee, D., Kim, S., 2022. Sparse structure learning via graph neural
pp. 432–448. networks for inductive document classification. In: Proceedings of the AAAI
Huang, L., Ma, D., Li, S., Zhang, X., Wang, H., 2019. Text level graph neural network Conference on Artificial Intelligence, vol. 36, No. 10, pp. 11165–11173.
for text classification. arXiv preprint arXiv:1910.02356. Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X., 2020. Pre-trained models for
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T., 2016. Bag of tricks for efficient text natural language processing: A survey. Sci. China Technol. Sci. 63 (10), 1872–
classification. arXiv preprint arXiv:1607.01759. 1897.
Kim, Y., 2014. Convolutional neural networks for sentence classification. In: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language
Moschitti, A., Pang, B., Daelemans, W. (Eds.), Proceedings of the 2014 models are unsupervised multitask learners. OpenAI blog 1 (8), 9.
Conference on Empirical Methods in Natural Language Processing, EMNLP Ragesh, R., Sellamanickam, S., Iyer, A., Bairi, R., Lingam, V., 2021. March). Hetegcn:
2014, October 25-29, 2014, Doha, Qatar, a Meeting of SIGDAT, A Special Interest heterogeneous graph convolutional networks for text classification. In:
group of the ACL, ACL, pp. 1746–1751. Proceedings of the 14th ACM International Conference on Web Search and
Koncel-Kedziorski, R., Bekal, D., Luan, Y., Lapata, M., Hajishirzi, H., 2019. Text Data Mining, pp. 860–868.
generation from knowledge graphs with graph transformers. arXiv preprint Rousseau, F., Kiagias, E., Vazirgiannis, M., 2015. July). Text categorization as a graph
arXiv:1904.02342. classification problem. In: Proceedings of the 53rd Annual Meeting of the
Korde, V., Mahender, C.N., 2012. Text classification and classifiers: A survey. Int. J. Association for Computational Linguistics and the 7th International Joint
Artif. Intell. Appl. 3 (2), 85. Conference on Natural Language Processing (Volume 1: Long Papers), pp.
Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., et al., 2020. A survey on text 1702–1712.
classification: From shallow to deep learning. arXiv preprint arXiv:2008.00364. Shen, D., Wang, G., Wang, W., Min, M.R., Su, Q., Zhang, Y.,et al., 018. Baseline needs
Li, J., Tang, T., Zhao, W.X., Wen, J.R., 2021. Pretrained language models for text more love: On simple word-embedding-based models and associated pooling
generation: A survey. arXiv preprint arXiv:2105.10311. mechanisms. arXiv preprint arXiv:1805.09843.
Lin, Y., Meng, Y., Sun, X., Han, Q., Kuang, K., Li, J., Wu, F., 2021. Bertgcn: Vashishth, S., Yadati, N., Talukdar, P., 2020. Graph-based deep learning in natural
Transductive text classification by combining gcn and bert. arXiv preprint language processing. In: Proceedings of the 7th ACM IKDD CoDS and 25th
arXiv:2105.05727. COMAD, pp. 371–372.
Liu, L., Finch, A., Utiyama, M., Sumita, E., 2016. Agreement on target-bidirectional Wan, Y., Gao, Q., 2015. An ensemble sentiment classification system of twitter data
LSTMs for sequence-to-sequence learning. In: Proceedings of the AAAI for airline services analysis. In: 2015 IEEE international conference on data
Conference on Artificial Intelligence, vol. 30, no. 1. mining workshop (ICDMW). IEEE, pp. 1318–1325.
Liu, X., You, X., Zhang, X., Wu, J., Lv, P., 2020. Tensor graph convolutional networks Wang, Y., Wang, C., Zhan, J., Ma, W., Jiang, Y., 2023. Text FCG: Fusing Contextual
for text classification. In; Proceedings of the AAAI Conference on Artificial Information via Graph Learning for text classification. Expert Syst. Appl.,
Intelligence, vol. 34, no. 05, pp. 8409–8416. 119658
Liu, B., Wu, L., 2022. Graph neural networks in natural language processing. Graph Wu, L., Chen, Y., Ji, H., Liu, B., 2021. Deep learning on graphs for natural language
Neural Networks: Found Front. Appl. 12, 463–481. processing. In: Proceedings of the 44th International ACM SIGIR Conference on
Malekzadeh, M., Hajibabaee, P., Heidari, M., Zad, S., Uzuner, O., Jones, J.H., 2021. Research and Development in Information Retrieval, pp. 2651–2653.
Review of graph neural network in text classification. In: 2021 IEEE 12th Annual Wu, L., Chen, Y., Shen, K., Guo, X., Gao, H., Li, S., Long, B., 2023. Graph neural
Ubiquitous Computing, Electronics & Mobile Communication Conference networks for natural language processing: A survey. Found. TrendsÒ Mach.
(UEMCON). IEEE, pp. 0084–0091. Learn. 16 (2), 119–328.
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J., 2021. Wu, H., Liu, Y., Wang, J., 2020. Review of text classification methods on deep
Deep learning–based text classification: a comprehensive review. ACM Comput. learning. Comput. Mater. Continua 63 (3), 1309.
Surv. (CSUR) 54 (3), 1–40. Xie, Q., Huang, J., Du, P., Peng, M., Nie, J.Y., 2021. June). Inductive topic variational
Niu, Z., Zhong, G., Yu, H., 2021. A review on the attention mechanism of deep graph auto-encoder for text classification. In: Proceedings of the 2021
learning. Neurocomputing 452, 48–62. Conference of the North American Chapter of the Association for
Onan, A., 2019. Topic-enriched word embeddings for sarcasm identification. Computational Linguistics: Human Language Technologies, pp. 4218–4227.
Software Engineering Methods in Intelligent Algorithms: Proceedings of 8th Yang, Y., Miao, R., Wang, Y., Wang, X., 2022. Contrastive Graph Convolutional
Computer Science On-line Conference 2019, vol. 18. Springer International Networks with adaptive augmentation for text classification. Inf. Process.
Publishing, pp. 293–304. Manag. 59, (4) 102946.
Onan, A., 2022. Bidirectional convolutional recurrent neural network architecture Yao, L., Mao, C., Luo, Y., 2019. Graph convolutional networks for text classification.
with group-wise enhancement mechanism for text sentiment classification. J. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01,
King Saud Un.-Comput. Informat. Sci. 34 (5), 2098–2117. pp. 7370–7377.
Otter, D.W., Medina, J.R., Kalita, J.K., 2020. A survey of the usages of deep learning Zhang, Y., Yu, X., Cui, Z., Wu, S., Wen, Z., Wang, L., 2020. Every document owns its
for natural language processing. IEEE Trans. Neural Networks Learn. Syst. 32 (2), structure: Inductive text classification via graph neural networks. arXiv
604–624. preprint arXiv:2004.13826.
Pang, B., Lee, L., 2005. Seeing stars: Exploiting class relationships for sentiment Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Sun, M., 2020. Graph neural
categorization with respect to rating scales. arXiv preprint cs/0506075. networks: A review of methods and applications. AI open 1, 57–81.
18