0% found this document useful (0 votes)
2 views

Hierarchical graph-based text classification framework with contextual

The document presents a novel hierarchical graph-based text classification framework that utilizes contextual node embedding and BERT-based dynamic fusion to enhance text classification accuracy. The framework consists of seven stages, including linguistic feature extraction, hierarchical node construction, and dynamic fusion with BERT, which collectively improve the model's ability to capture complex relationships within text data. Evaluations on benchmark datasets demonstrate significant improvements in classification accuracy compared to existing state-of-the-art methods.

Uploaded by

Shangbo Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Hierarchical graph-based text classification framework with contextual

The document presents a novel hierarchical graph-based text classification framework that utilizes contextual node embedding and BERT-based dynamic fusion to enhance text classification accuracy. The framework consists of seven stages, including linguistic feature extraction, hierarchical node construction, and dynamic fusion with BERT, which collectively improve the model's ability to capture complex relationships within text data. Evaluations on benchmark datasets demonstrate significant improvements in classification accuracy compared to existing state-of-the-art methods.

Uploaded by

Shangbo Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

Contents lists available at ScienceDirect

Journal of King Saud University –


Computer and Information Sciences
journal homepage: www.sciencedirect.com

Hierarchical graph-based text classification framework with contextual


node embedding and BERT-based dynamic fusion
Aytuğ Onan
_
Izmir _
Katip Çelebi University, Faculty of Engineering and Architecture, Department of Computer Engineering, 35620 Izmir, Turkey

a r t i c l e i n f o a b s t r a c t

Article history: We propose a novel hierarchical graph-based text classification framework that leverages the power of
Received 6 May 2023 contextual node embedding and BERT-based dynamic fusion to capture the complex relationships
Revised 26 May 2023 between the nodes in the hierarchical graph and generate a more accurate classification of text. The
Accepted 3 June 2023
framework consists of seven stages: Linguistic Feature Extraction, Hierarchical Node Construction with
Available online 13 June 2023
Domain-Specific Knowledge, Contextual Node Embedding, Multi-Level Graph Learning, Dynamic Text
Sequential Feature Interaction, Attention-Based Graph Learning, and Dynamic Fusion with BERT. The first
Keywords:
stage, Linguistic Feature Extraction, extracts the linguistic features of the text, including part-of-speech
Text classification
Hierarchical graph
tags, dependency parsing, and named entities. The second stage constructs a hierarchical graph based
Pre-trained language models on the domain-specific knowledge, which is used to capture the relationships between nodes in the
Contextual embedding graph. The third stage, Contextual Node Embedding, generates a vector representation for each node in
Attention mechanism the hierarchical graph, which captures its local context information, linguistic features, and domain-
specific knowledge. The fourth stage, Multi-Level Graph Learning, uses a graph convolutional neural net-
work to learn the hierarchical structure of the graph and extract the features of the nodes in the graph.
The fifth stage, Dynamic Text Sequential Feature Interaction, captures the sequential information of the
text and generates dynamic features for each node. The sixth stage, Attention-Based Graph earning, uses
an attention mechanism to capture the important features of the nodes in the graph. Finally, the seventh
stage, Dynamic Fusion with BERT, combines the output from the previous stages with the output from a
pre-trained BERT model to obtain the final integrated vector representation of the text. This approach
leverages the strengths of both the proposed framework and BERT, allowing for better performance on
the classification task. The proposed framework was evaluated on several benchmark datasets and com-
pared to state-of-the-art methods, demonstrating significant improvements in classification accuracy.
Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction tasks such as machine translation, information retrieval, and text


summarization. For instance, in information retrieval, accurate
Text classification is a fundamental task in natural language classification can help retrieve relevant documents more effec-
processing that involves categorizing text documents into prede- tively. In sentiment analysis, accurate classification can help busi-
fined classes or categories (Korde and Mahender, 2012). This task nesses to monitor the sentiment of customers towards their
has numerous practical applications, such as sentiment analysis, products and services. In text summarization, accurate classifica-
spam filtering, topic classification, and language identification tion can help identify the main topics and themes in a collection
(Aggarwal and Zhai, 2012). Efficient and accurate text classification of documents. Therefore, developing better text classification tech-
can significantly improve the performance of downstream NLP niques can have a significant impact on various NLP applications
(Minaee et al., 2021).
Deep learning has played a crucial role in improving the effi-
E-mail address: [email protected] ciency of text classification tasks in recent years (Li et al., 2020).
Peer review under responsibility of King Saud University. Traditional approaches to text classification relied on hand-
crafted features and shallow machine learning algorithms such as
Naïve Bayes, decision trees, and support vector machines. These
approaches were limited in their ability to capture the complex
Production and hosting by Elsevier relationships between words and the context in which they appear

https://doi.org/10.1016/j.jksuci.2023.101610
1319-1578/Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

(Wu et al., 2020). However, the development of deep learning tech- occurrence, dependency parsing, or knowledge graphs. In the
niques, such as convolutional neural networks (CNNs) and recur- learning phase, the graph is used to extract features and classify
rent neural networks (RNNs), has enabled the automatic learning the documents (Yao et al., 2019; Wang et al., 2023). Graph-based
of features directly from raw text data, which has greatly improved approaches offer several advantages over traditional methods, such
the accuracy of text classification (Otter et al., 2020). Despite the as the ability to capture semantic relationships between words and
success of deep learning approaches in recent years, text classifica- exploit the structural properties of the data (Ragesh et al., 2021).
tion still faces several challenges, including the ability to effec- However, there are still several obstacles to improving the effec-
tively capture the complex relationships between words and tiveness of graph-based approaches for text classification. One
concepts in a given text, and the difficulty in incorporating major challenge is the difficulty of capturing long-range interac-
domain-specific knowledge into the classification process tions between words, which can be critical for accurately classify-
(Malekzadeh et al., 2021). To address these challenges, researchers ing documents. Another challenge is the sparsity of the graph,
have proposed various techniques such as hierarchical classifica- which can lead to poor performance when using standard graph-
tion, graph-based models, and contextualized embeddings (Wang based algorithms (Piao et al., 2022).
et al., 2023). Graph-based approaches for text classification typically con-
Graph Neural Networks (GNNs) are a type of deep learning struct text graphs by representing the text as a set of nodes con-
model that have gained popularity in recent years for their ability nected by edges that capture various relationships between the
to effectively model graph-structured data, such as social net- nodes. However, constructing these graphs is not a trivial task,
works, molecular structures, and text data represented as hierar- and current approaches often rely on heuristics and domain-
chical graphs (Zhou et al., 2020). In the context of text specific knowledge to determine the appropriate structure of the
classification, GNNs have been shown to outperform traditional graph. This can limit the effectiveness of the resulting classification
sequential learning models such as Convolutional Neural Networks model, as important relationships between words and concepts
(CNNs) and Recurrent Neural Networks (RNNs) by leveraging the may not be captured by the graph (Koncel-Kedziorski et al.,
complex relationships between the nodes in a hierarchical graph 2019). Another challenge in graph-based text classification is
representation of the text (Wu et al., 2023). The key difference how to effectively combine different sources of contextual infor-
between GNNs and sequential learning models is that GNNs oper- mation, such as syntactic and semantic features, to improve classi-
ate directly on the graph structure, while sequential models pro- fication performance. Researchers have attempted to address this
cess the input text in a sequential manner. GNNs can capture the challenge by using deep learning models that can learn to auto-
structural information of the graph and leverage it to better repre- matically extract and combine these features in a data-driven man-
sent the text data, while sequential models may struggle to model ner. For example, recent studies have explored the use of graph
complex relationships between nodes in the graph (Liu and Wu, convolutional networks (GCNs), which can capture both the local
2022). GNNs achieve this by propagating information from neigh- and global structures of the text graph to better incorporate differ-
boring nodes to update the node representations iteratively. This ent sources of contextual information. Additionally, pre-trained
allows the model to capture the dependencies and interactions language models such as BERT have been used to encode contex-
between nodes in the graph and improve the overall performance tual information for each node in the graph, allowing for better
of text classification tasks (Wu et al., 2021). GNNs provide a pow- representation learning and classification performance (Wang
erful framework for modeling graph-structured data such as hier- et al., 2023). Despite these advancements, there are still challenges
archical graph representations of text data. By leveraging the to be addressed in graph-based text classification, such as dealing
relationships between nodes in the graph, GNNs can capture the with noisy and incomplete data, and designing effective graph
complex interactions between different elements of the text and structures that capture relevant semantic relationships between
provide superior performance compared to traditional sequential words and concepts. The proposed framework for hierarchical
learning models (Vashishth et al., 2020). graph-based text classification with contextual node embedding
In recent years, pre-trained language models have made signif- and BERT-based dynamic fusion is a novel approach that over-
icant contributions to improving the performance of text classifica- comes key limitations of existing models. The main contributions
tion tasks (Qiu et al., 2020). These models are trained on large of the study include:
amounts of unlabeled text data to learn general language represen-
tations, and can then be fine-tuned on a specific task, such as text  The integration of linguistic features, domain-specific knowl-
classification. One of the most widely used pre-trained models is edge, and contextual node embeddings into a hierarchical graph
the Bidirectional Encoder Representations from Transformers structure using a pre-trained language model (BERT).
(BERT), which has achieved state-of-the-art results on a variety  The incorporation of multi-level graph learning and attention-
of natural language processing tasks (Devlin et al., 2018). Other based graph learning to capture both the local and global rela-
notable pre-trained models include the Generative Pre-trained tionships between nodes in the text graph.
Transformer 2 (GPT-2) (Radford et al., 2019), the Transformer-XL  The dynamic fusion of the outputs from the previous stages
(Dai et al., 2019), and the Universal Language Model Fine-tuning with the output from a pre-trained language model to obtain
(ULMFiT) (Howard and Ruder, 2018). Pre-trained models offer sev- a final integrated vector representation of the text, allowing
eral advantages for text classification, including the ability to lever- for better performance on the classification task.
age large amounts of data to learn powerful language
representations, the ability to transfer knowledge from one task The manuscript is organized as follows: Section 2 presents a lit-
to another, and the ability to reduce the amount of labeled data erature review on the related works in the field of text classifica-
needed for training. These advantages have made pre-trained mod- tion, graph neural networks, and pre-trained language models.
els a popular choice for text classification tasks, particularly for Section 3 presents the proposed hierarchical graph-based text clas-
tasks with limited labeled data (Li et al., 2021; Qiu et al., 2020). sification framework with contextual node embedding and BERT-
Text classification using graph structures can be divided into based dynamic fusion, describing each individual stage in detail.
two phases: constructing the graph structure and learning from Section 4 presents the experimental setup, including the datasets,
it (Yao et al., 2019). In the construction phase, a graph is built based evaluation metrics, and experimental results. Finally, Section 5 pro-
on the document collection using various techniques such as co- vides a conclusion and highlights the contributions of the study.

2
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

2. Related work The effectiveness of GNNs in practice is limited due to the


inability to capture high-order interactions between words and
The success of deep learning approaches in text classification the inefficiency in handling large datasets and new documents.
has been widely acknowledged, with significant progress achieved To address these issues, an architecture, named, hypergraph atten-
in recent years. However, the current methods still have limita- tion networks (HyperGAT), has been proposed (Ding et al., 2020).
tions in capturing complex relationships between words and con- HyperGAT aims to provide more expressive power with less com-
cepts in a given text. To overcome these limitations, recent putational consumption for text representation learning. In
research has focused on incorporating graph-based structures, another study, Zhang et al. (2020) present a novel architecture,
pre-trained language models, and attention mechanisms. Several referred as, TextING, that utilizes graph neural networks (GNNs)
studies have highlighted the importance of word embeddings in to generate word representations in an inductive manner. Unlike
deep learning-based text classification (Chen, 2015). These embed- previous graph-based approaches that utilize global structures,
dings are representation models that capture information about TextING trains a GNN to capture detailed word-word relations
the semantic and syntactic structure of words or phrases (Niu within individual documents, which can then be generalized to
et al., 2021). Deep neural networks, such as CNN and RNN, have new documents during testing. The architecture build individual
been widely used for text classification, either individually or in graphs by using a sliding window approach within each document
combination (Zhang et al., 2020). The attention mechanism has and use gated graph neural networks to propagate information
also been integrated into these models to improve their expres- between word nodes and aggregate the document embedding.
siveness (Hu, 2020). However, these methods primarily focus on The study demonstrates that TextING outperforms baseline
local features and often fail to capture global contextual informa- approaches, even in situations where words in the test data are
tion or long-term dependencies (Liu et al., 2020). mostly unseen. The existing graph-based models often construct
Graph-based approaches have recently emerged as a promising text graphs by rules that introduce massive noise and cannot suf-
alternative to overcome the limitations of traditional deep ficiently exploit labeled and unlabeled node information. To
learning-based methods. These approaches use graph structures address these issues, some researchers have introduced contrastive
to model the relationships between words and concepts in a text. learning to the graph domain to better utilize node information. In
Graph Neural Networks (GNNs) are a popular technique used for this context, Yang et al. (2022) proposed a new graph-based model
graph-based text classification, which has shown promising results for text classification called CGA2TC, which introduces contrastive
in capturing long-range interactions between words and effec- learning with an adaptive augmentation strategy to obtain more
tively incorporating domain-specific knowledge. Rousseau et al. robust node representation. Specifically, a text graph has been con-
(2015) presented a graph-based approach to text categorization, structed by exploring word co-occurrence and document word
treating it as a graph classification problem. Instead of the relationships and design an adaptive augmentation strategy to
traditional bag-of-words representation, the study uses a graph- generate two contrastive views that highlight relatively important
of-words representation, which enables the extraction of more edges while solving the noise problem. In another study, Piao et al.
discriminative features that capture long-distance n-grams. The (2022) presented an architecute, named TextSSL, a graph-based
proposed method leverages frequent subgraph mining to extract sparse structure learning model that addresses the challenges of
these features, and makes use of the k-core concept to reduce the word ambiguity, word synonymity, and dynamic contextual
graph representation to its densest part, improving the efficiency dependency for inductive document classification by sparsely
of feature extraction with little to no impact on prediction perfor- selecting edges with dynamic contextual dependencies to jointly
mance. TextGCN (Yao et al., 2019) is a recent graph-based exploit local and global contextual information in documents. To
approach for text classification that has shown promising results. handle with documents containing new words and relations,
It uses a standard graph convolutional network (GCN) on a single Wang et al. (2023) proposed TextFCG method, which constructs a
large graph constructed from the entire corpus to capture the con- single graph for all words in each text, labeled by fusing various
textual information. Each word in a document is considered as a contextual relations to enhance GNN learning. The model interacts
node in the graph, and relationships between words are used to local words with global text information using GNN and GRU, and
extract features. TextGCN uses a pre-trained word embedding focuses on contextual features from the text itself.
model to initialize the word vectors in the graph, and then applies Compared to existing graph-based methods, such as TextGCN
GCN layers to capture the graph structure and aggregate informa- and TensorGCN, which typically use a single large graph to repre-
tion from neighboring nodes. Additionally, TextGCN employs a sent the relationships between words in the corpus, our proposed
hierarchical pooling technique to reduce the dimensionality of framework takes a novel approach by constructing a hierarchical
the graph and aggregate node features. Experimental results on graph based on domain-specific knowledge. This hierarchical
several benchmark datasets demonstrate that TextGCN outper- graph allows the model to capture relationships between nodes
forms several state-of-the-art methods for text classification. This at different levels of abstraction, leading to a more nuanced under-
approach is particularly effective in capturing long-range standing of the text data. By leveraging domain-specific knowl-
dependencies between words and utilizes contextual information edge, our model incorporates prior information and domain
effectively. Similarly, TensorGCN (Liu et al., 2020) is another expertise into the graph construction process. This enhances the
graph-based approach for text classification that uses a graph model’s ability to capture meaningful relationships between con-
tensor to capture the relationships between words in a document. cepts and improves its interpretability. The hierarchical structure
In this method, each word is treated as a node in the graph tensor, enables the model to capture both local and global dependencies,
and the edges between nodes represent the co-occurrence of as well as hierarchical relationships between words, resulting in
words in the same document. TensorGCN utilizes intra- and a more comprehensive representation of the text data. In addition,
inter-graph propagation learning to incorporate more contextual our framework incorporates contextual node embedding, which
information from the graph tensor. It applies GCN layers to the takes into account the contextual information of words. By
graph tensor to capture the graph structure and aggregate informa- utilizing pre-trained language models, such as BERT, to generate
tion from neighboring nodes. Additionally, TensorGCN employs a contextualized word representations, our model can capture the
multi-level pooling technique to aggregate node features and complex and context-dependent relationships between words.
reduce the dimensionality of the graph tensor. This enables the model to capture fine-grained nuances and

3
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

semantic associations, leading to more accurate text classification the model. The individual components of our proposed approach in
results. Furthermore, our model employs BERT-based dynamic the text classification framework are carefully justified to address
fusion, which dynamically combines the contextual node embed- specific challenges in the field. The utilization of hierarchical nodes
dings with BERT representations. This fusion process allows the aims to capture the hierarchical relationships and semantic struc-
model to leverage the power of both the hierarchical graph struc- ture inherent in text data, allowing for a more comprehensive
ture and the rich contextual information provided by BERT. By understanding of the textual content. By constructing these hierar-
integrating these two sources of information, our model can effec- chical nodes, we can effectively represent the nested relationships
tively capture the complex relationships between nodes and gen- between concepts, enabling a more fine-grained analysis of the
erate predictions that are more accurate. text data. To capture the contextual information and semantic
meaning of words within the hierarchical nodes, we employ con-
3. Proposed text classification framework textual node embedding. This approach utilizes pre-trained lan-
guage models such as BERT or GPT to encode the contextual
The proposed Dynamic Graph-based Text Analysis Framework information of words, considering their context-dependent nature.
using Attention Mechanisms is a powerful technique for text clas- By leveraging these contextual embeddings, we can better grasp
sification tasks that involve complex relationships between nodes the nuanced meanings and semantic relationships between words,
in a hierarchical graph. enhancing the model’s ability to comprehend the text data. Fur-
The general structure of the proposed scheme has been outlined thermore, our framework incorporates dynamic fusion with BERT
in Fig. 1. The framework consists of several stages, including Lin- to capitalize on the powerful contextual representations offered
guistic Feature Extraction, Hierarchical Node Construction with by this pre-trained language model. By dynamically fusing the con-
Domain-specific Knowledge, Contextual Node Embedding, Multi- textual node embeddings with BERT representations, our model
level Graph Learning, Attention-based Graph Learning, and benefits from both the hierarchical structure captured by the nodes
Dynamic Fusion with BERT. The framework leverages pre-trained and the rich contextual information captured by BERT. This
language models such as BERT to encode contextual information, integration leads to a more comprehensive understanding of the
linguistic features, and domain-specific knowledge into vector rep- text data and improves the model’s discriminative power. In addi-
resentations for each node in the graph. The attention-based graph tion, attention-based graph learning is employed to capture the
learning stage allows the model to capture the complex relation- relationships and dependencies between different nodes within
ships between nodes, while the multi-level graph learning stage the hierarchical graph. By incorporating attention mechanisms,
enables the model to learn representations at different levels of the model assigns importance weights to different nodes and
abstraction. The dynamic fusion stage combines the outputs from learns to focus on the most relevant information during classifica-
the previous stages with the output from an external model using tion. This attention-based graph learning enables the model to
a learning framework, which improves the overall performance of capture the intricate dependencies between concepts within the

Fig. 1. The general structure of the proposed text classification framework.

4
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

hierarchical structure, enhancing its overall discriminative capabil- implemented using various algorithms, such as the Stanford
ity. In summary, each component of our proposed approach has CoreNLP library.
been carefully justified based on its ability to address specific chal-
lenges in text classification. The hierarchical node construction
captures the hierarchical nature of text data, contextual node 3.2. Hierarchical node construction with domain-specific knowledge
embedding captures contextual information, dynamic fusion with
BERT leverages pre-trained language models, and attention-based In this stage, the text data is represented as a hierarchical graph,
graph learning captures relationships between nodes. By synergis- where each document is a node, and each sentence within the doc-
tically integrating these components, our approach improves the ument is a sub-node. In addition to this, domain-specific knowl-
accuracy, interpretability, and overall performance of the text clas- edge graph (i.e., WordNet) is incorporated to capture the
sification framework. domain-specific relationships and semantic meanings between
the words in the text. Here are the more details on this stage:
3.1. Linguistic feature extraction 1. Hierarchical Graph Construction: The first step is to con-
struct a hierarchical graph representation of the text data. In this
Linguistic Feature Extraction is a stage in the text classification graph, each document is represented as a node, and each sentence
process that involves extracting various linguistic features from within the document is a sub-node. This can be formalized as fol-
the preprocessed text data. These features can help to capture lows: Let T ¼ ft1 ; t2 ; . . . ; t n g be a set of preprocessed text data,
important information about the structure and meaning of the where t i represents the i-th document in the set. For each docu-
text, which can be used to improve the accuracy of text classifica- ment t i , construct a hierarchical graph Gi ¼ ðN i ; Ei Þ;where N i is
tion models. Here are the details on this stage: the set of nodes representing the document and its sentences,
and Ei is the set of edges representing the hierarchical relationships
 Part-of-Speech (POS) Tagging: Part-of-Speech (POS) tagging is a between the nodes. This can be formalized as:
process of assigning a part of speech tag to each word in the N i ¼ fn1 ; n2 ; . . . ; nmi g, where n1 represents the document node
text. This involves analyzing the syntactic context of each word and n2 ; . . . ; nmi represent the sentence nodes within the document.
to determine its grammatical category, such as noun, verb, Ei ¼ fðn1 ; n2 Þ; ðn1 ; n3 Þ . . . ; ðn1 ; nmi Þg, representing the hierarchical
adjective, etc. This can be formalized as follows: For each docu- relationships between the document and its sentences.
ment ti, perform POS tagging to obtain a sequence of part-of- 2. Domain-specific Knowledge Incorporation: The second step
speech tags is to incorporate domain-specific knowledge graphs (e.g., WordNet
 pi ¼ fp1 ; p2 ; . . . ; pni g, where pj represents the part-of-speech tag or medical ontologies) to capture the domain-specific relationships
of the j-th word in the document. The POS tagging process can and semantic meanings between the words in the text. This can be
be implemented using various techniques, such as rule-based formalized as follows: For each document ti, incorporate domain-
systems or machine learning models. specific knowledge graphs K i ¼ ðV i ; Ei Þ, where V i is the set of nodes
 Dependency Parsing: Dependency parsing is a process of iden- representing the words in the text, and Ei is the set of edges repre-
tifying the relationships between words in the text, such as senting the domain-specific relationships between the words. This
subject-verb or noun-adjective relationships. This involves con- can be formalized as: V i ¼ fv 1 ; v 2 ; . . . ; v ki g, where v j represents the
structing a dependency parse tree for each sentence in the doc- j-th word in the document. Ei ¼ fðv k ; v l Þg if there is a domain-
ument, where each node represents a word in the sentence, and specific relationship between words v k and v l . The domain-
each edge represents a dependency relationship between specific knowledge graphs can be constructed using various
words. This can be formalized as follows: For each document techniques, such as ontology-based methods or distributional
ti, perform dependency parsing to obtain a dependency parse semantics. By incorporating domain-specific knowledge in the
tree Di ¼ ðV; EÞ, where V is the set of nodes representing the hierarchical node construction stage, the model becomes more tai-
words in the document, and E is the set of edges representing lored to the specific domain, capturing its unique characteristics.
the dependency relationships between the words. The depen- This improves the accuracy of the classification process by
dency parsing process can be implemented using various algo- enabling the model to focus on the most relevant aspects of the
rithms, such as the Stanford Parser or the spaCy library. text data. Furthermore, the utilization of domain-specific knowl-
 Named Entity Recognition (NER): Named Entity Recognition edge enhances the interpretability and explainability of the model.
(NER) is a process of identifying and classifying named entities The hierarchical nodes derived from domain-specific knowledge
in the text, such as people, organizations, and locations. This provide a more intuitive representation of the underlying concepts
involves detecting the presence of named entities in the text and relationships in the domain. This allows users to understand
and assigning them to predefined categories. This can be for- how the model is making predictions and provides insights into
malized as follows: For each document ti, perform NER to obtain the factors influencing the classification decisions. In summary,
a set of named entities N i ¼ fn1 ; n2 ; . . . ; nmi g, where nj repre- the utilization of domain-specific knowledge in the hierarchical
sents the j-th named entity in the document, along with its cor- node construction stage enhances the accuracy, interpretability,
responding type. The NER process can be implemented using and relevance of the model for the specific domain. By incorporat-
various techniques, such as rule-based systems or machine ing domain-specific terminologies, taxonomies, expert knowledge,
learning models. and external resources, the model can capture the unique charac-
 Coreference Resolution: Coreference resolution is a process of teristics of the domain and improve the overall performance of
identifying when two or more words in the text refer to the the text classification framework.
same entity, such as when ‘‘he” refers to a previously mentioned 3. Hierarchical Graph and Knowledge Graph Fusion: The final
person. This involves grouping together all the words that refer step is to fuse the hierarchical graph and domain-specific knowl-
to the same entity and assigning them a single label. This can be edge graph to obtain an integrated graph representation of the text
formalized as follows: For each document t i , perform corefer- data. For each document ti , fuse the hierarchical graph Gi ¼ ðN i ; Ei Þ
ence resolution to identify when two or more words in the and domain-specific knowledge graph K i ¼ ðV i ; Ei Þto obtain an
document refer to the same entity, and group them together integrated graph G0i ¼ ðN0i ; E0i Þ, where N0 i is the set of nodes
into a single entity. The coreference resolution process can be representing the document and its sentences, as well as the words

5
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

in the text, and E0i is the set of edges representing the hierarchical based on its context in the concatenated text. In addition to captur-
relationships between the nodes and the domain-specific relation- ing the local context information of each node, the language model
ships between the words. This can be formalized as: N0i ¼ ðN i [ V i Þ, can also incorporate linguistic features, such as part-of-speech
where V i represents the nodes in the domain-specific knowledge tags, dependency parsing, and named entities, by inputting the text
graph. E0i ¼ ðEi [ Eki Þ, where Eki represents the edges in the along with these features. This allows the language model to learn
domain-specific knowledge graph. The fusion process can be per- how to encode the linguistic properties of the text, which can be
formed using various techniques, such as graph convolutional net- useful for downstream tasks such as text classification. Finally,
works or attention mechanisms. For example, graph convolutional the language model can also incorporate domain-specific knowl-
networks can be used to learn node embeddings that capture both edge by using knowledge graphs such as WordNet or medical
the hierarchical and domain-specific relationships in the graph. ontologies to provide additional context for each node. By incorpo-
Attention mechanisms can be used to weigh the importance of dif- rating domain-specific knowledge into the language model, it can
ferent nodes in the graph based on their relevance to the classifica- learn to encode the semantic relationships between words and
tion task. concepts in the domain, which can be useful for tasks such as
Representing the text data as a hierarchical graph is an efficient entity recognition and relation extraction. The contextual node
approach to text representation, as it captures the hierarchical embedding is a powerful technique for representing the nodes in
relationships between the different components of the text (i.e., a hierarchical graph as vectors that capture their local context, lin-
documents and sentences). This allows the model to better under- guistic features, and domain-specific knowledge. By using a pre-
stand the structure of the text data, which can improve its perfor- trained language model, we can take advantage of the large
mance on text classification tasks. Incorporating domain-specific amounts of text data that are available and learn to encode com-
knowledge graphs into the text representation enables the model plex relationships between nodes in the graph.
to capture the domain-specific relationships and semantic mean- Let G ¼ ðV; EÞ be the hierarchical graph with node set V and
ings between words in the text. This is important because the edge set E. Each node v 2 V represents a piece of text, such as a
meaning of words can vary depending on the domain or context document or a sentence, and has associated text features xv that
in which they are used. By incorporating domain-specific knowl- describe its local context, linguistic properties, and domain-
edge graphs, the model can better understand the meaning of specific knowledge. Let f ðxv ; hÞ be a pre-trained language model
words in the context of the domain being analyzed, which can with parameters h that takes as input the text features xv associ-
improve its accuracy on text classification tasks. Fusing the hierar- ated with each node v and generates a contextualized vector rep-
chical graph and domain-specific knowledge graph into an inte- resentation hv 2 Rd for each node. The contextualized
grated graph representation enables the model to capture both representation hv encodes the local context, linguistic properties,
the hierarchical and domain-specific relationships in the text data. and domain-specific knowledge of the node, and is generated by
This allows the model to learn more informative node embeddings applying a non-linear function g to the output of the language
that capture both the structural and semantic information in the model:
text data. Additionally, the fusion process can be performed using
various techniques, such as graph convolutional networks or atten- hv ¼ gðf ðxv ; hÞÞ ð1Þ
tion mechanisms, which allows for more flexibility and customiza-
where g is a non-linear function, such as a rectified linear unit
tion in the model architecture. The Hierarchical Node Construction
(ReLU) or a sigmoid function. To generate the contextualized vector
with Domain-specific Knowledge stage is a novel approach to text
representation hv for each node v , the text features xv associated
representation that leverages the structure and semantic meaning
with the node are first concatenated with the text features xu asso-
of the text data to improve text classification accuracy. By incorpo-
ciated with each neighboring node u 2 Nðv Þ, where Nðv Þ is the set of
rating hierarchical and domain-specific knowledge into the graph
neighboring nodes of v in the graph. The concatenated text features
representation, the model is better able to understand the complex
are then input to the language model f , which generates the contex-
relationships between the different components of the text and the
tualized vector representation hv for the node. Formally, the input
meaning of the words in the context of the domain being analyzed.
to the language model for node v is defined as:
In Fig. 2, hierarchical, domain-specific knowledge, and integrated
X
graphs that will be generated for a sample text data consisting of xv ¼ ½xv ; xu  ð2Þ
three documents, each with three sentences have been illustrated. fu2N ðv Þg

P
3.3. Contextual node embedding where [;] denotes concatenation and denotes summation. This
input concatenates the text features of node v with the sum of
Contextual node embedding is the process of representing each the text features of its neighboring nodes, capturing the local con-
node in the hierarchical graph as a vector that encodes its local text of the node in the graph.
context information, linguistic features, and domain-specific The output of the language model f is a sequence of d-
knowledge. The goal of contextual node embedding is to capture dimensional vectors fh1 ; h2 ; . . . ; hn g, where n is the length of the
the complex relationships between the nodes in the graph, as well input sequence. The contextualized vector representation hv for
as the broader context in which each node appears. This is node v is defined as the corresponding vector in the output
achieved using a pre-trained language model, i.e., BERT, which is sequence:
trained on large amounts of text data to learn how to encode the hv ¼ hi ð3Þ
contextual information of the text. The language model generates
a vector representation, also known as an embedding, for each where i is the index of the token in the input sequence that corre-
word in the text that captures its meaning and context. The same sponds to node v . By using a pre-trained language model to gener-
pre-trained language model can be used to generate embeddings ate the contextualized vector representation for each node in the
for the nodes in the hierarchical graph. To generate the embed- hierarchical graph, we can capture the complex relationships
dings for each node in the hierarchical graph, the text associated between the nodes, as well as the broader context in which each
with each node is first concatenated with the text of its neighbor- node appears. This representation can be used as input to down-
ing nodes. This concatenated text is then fed into the language stream tasks such as text classification, entity recognition, and rela-
model, which generates a vector representation for each node tion extraction, improving the performance of these tasks.
6
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

Fig. 2. The sample hierarchical, domain-specific and integrated graphs generated in Stage 3.2.

Suppose we have a hierarchical graph that represents a collec- entity ‘‘company”. To generate the contextualized vector represen-
tion of news articles. The graph consists of document nodes, which tation for this sentence node, we concatenate its text features with
represent the articles, and sentence nodes, which represent the the text features of its neighboring sentence nodes and input the
sentences within the articles. Each node has associated text fea- resulting sequence to the language model. The language model
tures that describe its local context, such as the words in the sen- generates a d-dimensional vector representation that captures
tence and their part-of-speech tags, as well as domain-specific the meaning and context of the sentence node. This process is
knowledge, such as the named entities mentioned in the text. To repeated for all nodes in the hierarchical graph, generating a con-
generate the contextualized vector representation for each node textualized vector representation for each node. These representa-
in the graph, we use a pre-trained language model such as BERT. tions can be used as input to downstream tasks such as text
We first concatenate the text features of each node with the text classification, entity recognition, and relation extraction, improv-
features of its neighboring nodes, capturing the local context of ing the performance of these tasks by capturing the complex rela-
the node in the graph. We then input this concatenated text to tionships between the nodes in the graph.
the language model, which generates a contextualized vector rep-
resentation for each node. For example, consider the sentence node 3.4. Multi-level graph learning
‘‘The company announced its latest earnings report yesterday.”
This sentence node has associated text features that describe the The Multi-level Graph Learning stage is a critical component of
words in the sentence (‘‘the”, ‘‘company”, ‘‘announced”, ‘‘its”, ‘‘lat- the text classification framework that enables the model to learn
est”, ‘‘earnings”, ‘‘report”, ‘‘yesterday”) and their part-of-speech the representations of the nodes in the hierarchical graph. This
tags (‘‘DT”, ‘‘NN”, ‘‘VBD”, ‘‘PRP$”, ‘‘JJS”, ‘‘NNS”, ‘‘NN”, ‘‘NN”). It also stage employs a combination of Graph Neural Networks (GNNs)
has associated domain-specific knowledge, such as the named and attention mechanisms to capture the contextual information
7
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

from neighboring nodes, linguistic features, and domain-specific  Multi-level Graph Learning: The graph is learned at multiple
knowledge, and to enhance the representation learning. The graph levels of granularity (word, sentence, and document levels)
is learned at multiple levels of granularity, which allows the model using a combination of GNNs and attention mechanisms. This
to capture the different levels of information in the text data. For allows the model to capture the different levels of information
instance, at the word level, the model can learn the relationships in the text data and to learn representations that are optimized
between individual words and their neighbors within the context for each level of granularity.
of the sentence. At the sentence level, the model can learn the rela-  Output Prediction: The final prediction is obtained by applying
tionships between sentences within a document, and at the docu- a classification layer to the learned representations of the nodes
ment level, the model can learn the relationships between different in the hierarchical graph. The final prediction is obtained by
documents in the corpus. The multi-level graph learning scheme applying a classification layer to the learned representations
consists of the following key steps: of the nodes in the hierarchical graph, as follows:

flg
 Node Initialization: Each node in the hierarchical graph is ini- y ¼ softmaxðW y hi Þ ð8Þ
tialized with its contextualized representation obtained from
the Contextual Node Embedding stage. where l is the final level of the graph learning, W y is a learnable
 Neighborhood Aggregation: For each node in the graph, its weight matrix, and softmax is the function that normalizes the out-
neighboring nodes are identified based on a pre-defined thresh- put into a probability distribution over the class labels.
old distance in the graph. The contextualized representations of The Multi-level Graph Learning stage enables the model to learn
the neighboring nodes are then aggregated using a Graph Neural representations of the nodes in the hierarchical graph that capture
Network (GNN) to capture the contextual information from the complex relationships between the nodes in the graph, and to
neighboring nodes. The goal of this step is to aggregate the rep- use this information to improve the accuracy of text classification
resentations of the neighboring nodes for each node i in the tasks. By incorporating both the contextualized representations
graph. Let G ¼ ðV; EÞbe the hierarchical graph representing the and the linguistic features and domain-specific knowledge, the
text data, where V is the set of nodes in the graph (including doc- model can capture a wide range of information in the text data,
uments, sentences, and words), and E is the set of edges between and by learning the graph at multiple levels of granularity, the
0 model can capture the different levels of information in the text
the nodes. Let hi be the initial representation of node i in the
data.
graph, obtained from the Contextual Node Embedding stage.
Let N i be the set of neighboring nodes of node i, defined as:
3.5. Dynamic text sequential feature interaction
Ni ¼ fjbði; jÞ 2 E and dði; jÞ  rg ð4Þ

where dði; jÞ is the distance between nodes i and j in the graph, The Dynamic Text Sequential Feature Interaction stage is used
and r is a pre-defined threshold distance. The aggregated repre- to capture the sequential information of the entire document,
sentation for node i is then obtained by applying a Graph Neural which is important for many text classification tasks such as senti-
Network (GNN) to the set of neighboring nodes, as follows: ment analysis or predicting the next word in a sentence. This stage
  uses a Dynamic Time Warping (DTW) algorithm to align the
flþ1g l l
hi ¼ f ðhi ; aggregatefj2Ni g g hi Þ ð5Þ sequence of word embeddings, which allows for the identification
of important temporal relationships between the words in the text.
where l is the current level of the graph learning, f is a non-linear The DTW algorithm is used to align the sequence of word embed-
activation function, g is a transformation function that maps the dings to identify the optimal path through the sequence of embed-
input representation to a new representation, and aggregate is a dings. The optimal path represents the alignment that minimizes
permutation invariant function that aggregates the representa- the distance between the two sequences. In the context of text
tions of the neighboring nodes. One common choice for aggre- classification, the two sequences are the sequence of word embed-
gate is the max pooling function. dings for a given document and a reference sequence of embed-
dings (e.g., the average embeddings for a particular category).
 Linguistic Feature Integration: In addition to the contextual- The general structure of the Dynamic Time Warping (DTW) algo-
ized representations, the linguistic features (such as part-of- rithm used in the Dynamic Text Sequential Feature Interaction
speech tags, dependency parsing, named entities) and stage has been presented in Algorithm 1.
domain-specific knowledge are also integrated into the node
representations using attention mechanisms. Let X be the Algorithm 1. The general structure of the Dynamic Time Warping
matrix of linguistic features and domain-specific knowledge (DTW) algorithm.
for all nodes in the graph. The attention mechanism is defined
as:
h i Inputs: sequence X of length n, sequence Y of length m
l
ai ¼ softmaxðW a hi ; xi Þ ð6Þ Output: DTW distance between X and Y
h i 1. Create a distance matrix D of size ðn þ 1Þ  ðm þ 1Þ and
l
where hi ; xi is the concatenation of the current representation and initialize all values to infinity.
the corresponding row of X, and W a is a learnable weight matrix. 2. Set D½0; 0 ¼ 0.
The final representation for node i is then obtained by weighted 3. For i in 1 to n, and j in 1 to m:
summing the current representation and the attention weights, as a. Compute the distance between X ½i and
follows: Y ½j : dist ¼ distanceðX ½i; Y ½jÞ
b. Update the distance matrix:
flþ1g lþ1 flþ1g
hi ¼ hi ; sumfj2Ni g afijg hj ð7Þ D½i; j ¼ dist þ minðD½i  1; j; D½i; j  1; D½i  1; j  1Þ
4. Return D½n; m
where afijg is the attention weight assigned to node j by node i.

8
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

In Algorithm 1, the distance function distanceðX ½i; Y ½jÞ is a function where f task is a function that computes the task-specific query vec-
that computes the distance between two word embeddings X ½i and tor T. The task-specific function is a neural network classifier that
Y ½j. The distance function can be any function that computes the takes as input the integrated vector representation of the entire
distance between two vectors, such as Euclidean distance or cosine document and outputs the predicted class label. For example, let
similarity. The algorithm computes a distance matrix D, where D½i; j we say we have two classes (positive and negative) and a neural net-
represents the minimum distance between the first i elements of work classifier with weights W c and bias bc . Then the task-specific
sequence X and the first j elements of sequence Y. The algorithm function can be defined as:
starts by initializing the distance between the first elements of both
f task ðT Þ ¼ rðW c T þ bc Þ ð10Þ
sequences to 0. Then, for each subsequent element in X and Y, it
computes the distance between the two elements and updates where r is the sigmoid activation function. The output of this func-
the distance matrix D with the minimum distance among the three tion is a scalar value between 0 and 1, representing the probability
possible paths: going down from the previous row, going right from of the input document belonging to the positive class. Next, we
the previous column, or going diagonally from the previous diago- compute the attention weights for each node i in V based on its
nal. Once the distance matrix D is computed, the DTW distance embedding hi and the query vector q:
between the two sequences is simply the value in the bottom right
ei ¼ g ðhi ; qÞai ¼ softmaxðei Þ ð11Þ
corner of the matrix, which represents the minimum distance
between the entire sequences. The DTW algorithm is dynamic where g is a function that computes the attention score between
because it allows for the sequences to be aligned in a non-linear the node embedding hi and the query vector q, and softmax is the
fashion, allowing for variations in the timing and duration of the function that normalizes the attention scores to obtain a probability
events in the sequences. This makes it a useful tool for capturing distribution over the nodes. We then compute a weighted sum of
the sequential information of the entire document, which is impor- the node embeddings to obtain a new representation of the text:
tant for many text classification tasks.
Once the alignment is obtained, a feature interaction mecha- hG ¼ sumðai  hi Þ ð12Þ
nism is used to capture the important temporal relationships where hG is the integrated vector representation encoding the text.
between the words in the text. This is achieved by multiplying The Attention-based Graph Learning stage uses an attention
the aligned word embeddings together to create a new set of fea- mechanism to compute attention weights for each node in the
tures that captures the interactions between neighboring words. graph based on its embedding and a task-specific query vector.
The resulting feature vector can be used as input to a machine The attention weights are used to compute a weighted sum of
learning algorithm for text classification. the node embeddings, yielding an integrated vector representation
that captures the important information from the nodes and allows
3.6. Attention-based graph learning the model to focus on the most relevant nodes in the graph.

The Attention-based Graph Learning stage is used to calculate 3.7. Dynamic fusion with pre-trained language models
attention weights for each node in the hierarchical graph and com-
pute a weighted sum of the node embeddings to yield an inte- In this stage, we apply the previous stages of the framework
grated vector representation encoding the text. An attention (i.e., Linguistic Feature Extraction, Hierarchical Node Construction
mechanism is employed to better capture the important informa- with Domain-specific Knowledge, Contextual Node Embedding,
tion from the nodes, allowing the model to focus on the most rel- Multi-level Graph Learning, Dynamic Text Sequential Feature
evant nodes in the graph. The attention weights are calculated Interaction, and Attention-based Graph Learning) to obtain the
based on the similarity between the node embeddings and a query integrated vector representation of the text. Next, we feed the inte-
vector, which is typically learned during the training process. The grated vector representation into BERT, which generates a contex-
similarity can be computed using any function that computes the tualized representation of the text. Finally, we concatenate the
similarity between two vectors, such as dot product or cosine sim- contextualized representation from BERT with the output from
ilarity. The attention weights are then used to compute a weighted the previous stages, and pass the concatenated vector through a
sum of the node embeddings, where the weights serve as the fully-connected layer to obtain the final class label. This approach
weights for the sum. This results in a new representation of the leverages the strengths of both the proposed framework and BERT,
text that captures the important information from the nodes, with allowing for better performance on the classification task.
higher weights given to nodes that are more relevant to the task at Let C be the set of classes, and let X be the input text. Let f C be a
hand. The attention mechanism allows the model to focus on the function that maps X to a set of class probabilities, i.e.,
most relevant nodes in the graph, which can vary depending on
the specific text classification task. For example, in a sentiment f CðxÞ ¼ ½pðcjX Þfc2Cg ð13Þ
analysis task, the model may need to focus more on the words that
where pðcjX Þ is the probability of class c given input text X. The out-
convey sentiment, while in a topic classification task, the model
put of the Dynamic Fusion with BERT is obtained by dynamically
may need to focus more on the words that relate to the topic of
combining the outputs from the previous stages with the outputs
the document.
from a pre-trained BERT model. Let hBERT be the output of the pre-
Let G ¼ ðV; EÞ be a hierarchical graph representing a document,
trained BERT model for input X, and let hi be the output of the i-
where V is the set of nodes and E is the set of edges. Each node i in
th previous stage. The output of the Dynamic Fusion with BERT
V represents a word in the document and is embedded with a con-
stage is defined as:
textualized representation hi , which encodes its local context infor-
mation, linguistic features, and domain-specific knowledge. We g ðhBERT ; h1 ; h2 ; . . . ; hn Þ ¼ f C ðconcatenateðhBERT ; h1 ; h2 ; . . . ; hn ÞÞ ð14Þ
first compute a query vector q that captures the task-specific
information: where concatenateðhBERT ; h1 ; h2 ; . . . ; hn Þ denotes the concatenation of
the output vectors hBERT ; h1 ; h2 ; . . . ; hn from each previous stage, and
f C is a function that maps the concatenated output vector to a set of
q ¼ f task ðTÞ ð9Þ class probabilities. The parameters of the model are optimized by

9
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

minimizing a loss function that measures the difference between length is 17.5 words, and the vocabulary size is 12,090 words
the predicted class probabilities and the true class labels. The loss (Onan, 2019).
function can be defined as:
In Table 1, a summary table for the datasets used in the exper-
L ¼ logðpðyjX ÞÞ ð15Þ
imental analysis has been presented.
where y is the true class label, and pðyjX Þ is the predicted probabil-
ity of the true class label given input text X. 4.2. Baselines

This section briefly explains the state-of-the-art methods uti-


4. Experimental results and discussion lized in the empirical analysis:

In this section, we present the experimental results and discus-  BERT: BERT (Bidirectional Encoder Representations from Trans-
sion of our proposed Hierarchical Graph-based Text Classification formers) is a pre-trained language model developed by Google
Framework with Contextual Node Embedding and BERT-based that utilizes a transformer-based architecture to generate con-
Dynamic Fusion, as well as the baseline models. The experiments textualized word embeddings. BERT has achieved state-of-the-
were conducted on several benchmark datasets to evaluate the art performance in various natural language processing tasks,
effectiveness and generalizability of our proposed framework. We including text classification, question answering, and named
report the classification accuracy, precision, recall, and F1-score entity recognition (Devlin et al., 2018).
to measure the performance of each model. Furthermore, we con-  BERT-GAT: It is a graph-based approach for text classification
duct an ablation study to investigate the contribution of each mod- that combines the power of pre-trained language models like
ule in our proposed framework. BERT and the Graph Attention Network (GAT) to capture com-
plex relationships between words and concepts in a text. The
4.1. Datasets approach constructs a graph from the text, with words as nodes
and their relationships as edges, and applies the GAT to learn
The experimental evaluation in this study utilizes several the representations of each node. Then, the representations
benchmark datasets that are commonly used to assess the perfor- are fed into a BERT model to generate the final classification
mance of state-of-the-art models in text classification. The rest of result (Lin et al., 2021).
this section presents a brief explanation of each dataset and its  BiLSTM: It stands for Bidirectional Long Short-Term Memory. It
descriptive statistics: is a type of Recurrent Neural Network (RNN) that processes
sequential data in both forward and backward directions.
 20NG: The dataset called ‘‘20 Newsgroups” is composed of BiLSTMs are commonly used in Natural Language Processing
around 20,000 documents related to different newsgroups, dis- (NLP) tasks, such as text classification and sentiment analysis,
tributed nearly equally across 20 categories. On average, each to capture the sequential nature of language (Liu et al., 2016).
document is 96.5 words long and the vocabulary size in the  CGA2TC: It is a graph-based approach for text classification that
dataset is 42,757 words (Wang et al., 2023). incorporates contrastive learning with an adaptive augmenta-
 Airline Twitter dataset: This dataset contains approximately tion strategy to obtain more robust node representation. It con-
14,000 tweets related to major US airlines. Each tweet is labeled structs a text graph by exploring word co-occurrence and
as positive, negative, or neutral. The average tweet length is document word relationships, and introduces an adaptive aug-
14.4 words, and the vocabulary size is 11,168 words (Wan mentation strategy to generate two contrastive views that high-
and Gao, 2015; Onan, 2022). light important edges while reducing noise (Yang et al., 2022).
 App dataset: The App dataset consists of approximately  CNN-non-static: It utilizes a convolutional layer to extract local
752,937 reviews of mobile apps from the Apple App Store. Each features from the input text and a max-pooling layer to obtain
review is labeled as positive or negative. The average review the most important features. The network is trained end-to-
length is 27.8 words, and the vocabulary size is 20,238 words end using backpropagation with stochastic gradient descent
(He and McAuley, 2016). (Kim, 2014).
 MR: The MR dataset consists of movie reviews labeled as posi-  FastText: It is a word embedding-based text classification
tive or negative. It contains 10,662 reviews, with an average method that utilizes n-gram features and provides an efficient
length of 19.8 words and a vocabulary size of 18,764 words way of representing the semantics of words and phrases in text
(Pang and Lee, 2005). data (Joulin et al., 2016).
 Ohsumed: The Ohsumed dataset is a collection of abstracts from  HyperGAT: It is a graph-based approach for text classification
medical research papers. It contains 7400 documents, parti- that uses hypergraphs to model the relationships between
tioned into 23 different medical categories. The average docu- words in a document. It employs dual attention mechanisms
ment length is 135,82 words, and the vocabulary size is to capture the high-order interactions between words and
14,157 words (Hersh et al., 1994). improve the efficiency of feature extraction (Ding et al., 2020).
 R52: The R52 dataset is a collection of news articles from the  SWEM: It is a text classification approach that utilizes pre-
Reuters newswire service. It contains 9,100 documents, parti- trained word embeddings to encode text. SWEM uses different
tioned into 52 different topics. The average document length pooling mechanisms, such as max-pooling and average-pooling,
is 245.6 words, and the vocabulary size is 49,230 words to obtain a fixed-length document representation (Shen et al.,
(Wang et al., 2023). 2018).
 R8: The R8 dataset is a subset of the R52 dataset, containing  TensorGCN: It is a graph-based approach for text classification
7,674 documents partitioned into 8 different topics. The aver- that uses a graph tensor to capture the relationships between
age document length is 244.9 words, and the vocabulary size words in a document. In this method, each word is treated as
is 11,692 words (Wang et al., 2023). a node in the graph tensor, and the edges between nodes repre-
 Sarcasm dataset: This dataset consists of approximately 40,000 sent the co-occurrence of words in the same document (Yao
tweets labeled as sarcastic or non-sarcastic. The average tweet et al., 2019).

10
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

Table 1
The descriptive information for the benchmark datasets.

Dataset Name Number of Documents Number of Categories Average Document Length Vocabulary Size
20NG 20,000 20 96.5 words 42,757 words
Airline Twitter dataset 14,000 3 (Positive, Negative, Neutral) 14.4 words 11,168 words
App dataset 752,937 2 (Positive, Negative) 27.8 words 20,238 words
MR 10,662 2 (Positive, Negative) 19.8 words 18,764 words
Ohsumed 7,400 23 135.82 words 14,157 words
R52 9,100 52 245.6 words 49,230 words
R8 7,674 8 244.9 words 11,692 words
Sarcasm dataset 40,000 2 (Sarcastic, Non-sarcastic) 17.5 words 12,090 words

 TextFCG: The architecture addresses the limitations of tradi- Part-of-Speech (POS) tagging. This means that the model only
tional transductive learning approaches for text classification considers the grammatical structure of the text and extracts
by constructing a single graph for all words in each text and features such as noun, verb, and adjective tags.
labeling the edges with various contextual relations. The text  Model Variant-2: It employs Linguistic Feature Extraction (LFE)
graph contains different information of documents and based on Dependency parsing only. This means that instead of
enhances the connectivity of the graph by introducing more using Part-of-Speech (POS) tagging to extract linguistic features,
typed edges, which improves the learning effect of the graph the model uses dependency parsing to identify the grammatical
neural network (Wang et al., 2023). relationships between words in a sentence.
 TextFCG-BERT: The architecture combined the TextFCG  Model Variant-3: This variant focuses on extracting entities
method, which constructs a graph for text classification based such as people, organizations, and locations from the input text
on fused contextual information, with BERT (Wang et al., 2023). as features to enhance the classification performance. By using
 TextGCN: It uses a pre-trained word embedding model to ini- only NER features, this variant aims to investigate the effective-
tialize the word vectors in the graph and then applies GCN lay- ness of named entity recognition in capturing the discrimina-
ers to capture the graph structure and aggregate information tive information for text classification tasks.
from neighboring nodes (Yao et al., 2019).  Model Variant-4: This variant utilizes LFE based on coreference
 TextING: It is a graph-based approach for inductive text classi- resolution only, where the model extracts coreferent mentions
fication. Unlike traditional approaches that use pre-defined or in the text and assigns them to their respective entities. This
fixed graphs, TextING constructs graphs on the fly, specific to variant does not consider any other linguistic features such as
each document. It employs a sliding window approach to build part-of-speech tagging or dependency parsing.
individual graphs, where each word in a document is repre-  Model Variant-5: It utilizes the hierarchical node construction
sented as a node, and relationships between words are used stage without incorporating domain-specific knowledge.
to extract features (Zhang et al., 2020).  Model variant-6: This variant removes the contextual node
 TextING-M: It is a transductive variant of the TextING model for embedding stage from the proposed framework. Without this
inductive text classification. Unlike the original TextING, which stage, the model is unable to capture the local context informa-
constructs a single large graph from the entire corpus and uses tion and linguistic features of each node in the graph.
it to generate word embeddings, TextING-M constructs a sepa-  Model variant-7: This variant removes the dynamic text
rate graph for each document and extracts edges from the large sequential feature interaction stage from the proposed frame-
graph based on the whole corpus (Zhang et al., 2020). work. Without this stage, the model is unable to capture the
 Text-level GNN: It is a graph-based approach for text classifica- sequential information of the text and generate dynamic fea-
tion that builds document-level graphs with global parameters tures for each node.
sharing to learn text representations. The model employs graph  Model variant-8: This variant removes the attention-based
neural networks (GNNs) to capture the graph structure and graph learning stage from the proposed framework. Without
aggregate information from neighboring nodes (Huang et al., this stage, the model is unable to capture the important features
2019). of the nodes in the graph and may not be able to fully utilize the
 TextSSL: It is a graph-based sparse structure learning model for hierarchical structure of the graph.
inductive document classification that addresses challenges  Model variant-9: This variant removes the dynamic fusion with
such as word ambiguity, word synonymity, and dynamic con- BERT stage from the proposed framework.
textual dependency (Piao et al., 2022).
 T-VGAE (Topic-Enhanced Variational Graph Auto-Encoder): It
is a graph-based text classification model that leverages topic 4.4. Experimental settings
modeling and variational autoencoders (VAEs) to capture the
underlying structure of text documents (Xie et al., 2021). For the empirical analysis, we have adopted the empirical
framework outlined in Wang et al., 2023. We used the provided
4.3. Model variations training and testing sets for each dataset, and randomly selected
10% of the training set as the validation set. By default, we used
To evaluate the proposed framework’s effectiveness, the study two layers of the GNN module and tuned the hyperparameters to
examined nine different model variants. A brief overview of the optimize performance on the validation set. We used the Adam
model variants utilized in the empirical analysis is presented optimizer with a learning rate of 0.001 for Ohsumed, 8e-4 for
below: 20NG, and 3e-4 for the other datasets. Dropout was set at 0.5 for
each module, and we randomly dropped edges with a probability
 Model Variant-1: It is a variation of the proposed framework of 0.3 for the best performance. For the word embeddings, we ini-
that only uses Linguistic Feature Extraction (LFE) based on tialized those using pre-trained GloVe vectors with dimension 300.
We froze the word embeddings during model training for inductive
11
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

text classification tasks. Out-of-vocabulary words were replaced baselines on all datasets, indicating the importance of each module
with UNK and randomly sampled from a uniform distribution of in the framework. In particular, variants that remove the dynamic
[0.01, 0.01]. We used default parameter settings for baseline fusion with BERT, attention-based graph learning, or dynamic text
models, as described in their original papers or implementations. sequential feature interaction perform worse than the proposed
The L2 loss weight was set to 5e-5 for R8 and 20NG and 5e-6 for framework, highlighting the effectiveness of these modules in
other datasets. In Table 2, the parameter list for the proposed improving classification performance. Additionally, model variants
scheme has been outlined. that rely on only one type of linguistic feature extraction perform
worse than the proposed framework, indicating that combining
multiple types of linguistic features is beneficial for text classifica-
4.5. Experimental results
tion. Among the baseline models, BERT + GAT, Bi-LSTM, and Hyper-
GAT perform relatively well, while fastText and SWEM perform the
This section covers the experimental results and discussion of
worst. It’s worth noting that BERT, which has been widely used in
our proposed text classification framework, which is based on hier-
many NLP tasks, achieves high performance in most datasets. This
archical graphs, contextual node embedding, and dynamic fusion
suggests that pre-trained language models like BERT can provide
with BERT. We also present the results of the baseline models
strong feature representations for text classification tasks. In terms
and evaluate the effectiveness and generalizability of our frame-
of model variants, we can see that LFE based on POS tagging only
work on various benchmark datasets. To measure the performance
(Model variant-1) contributes significantly to the overall perfor-
of each model, we report accuracy, precision, recall, and F1-score.
mance of the proposed model, indicating that part-of-speech tags
Additionally, we conduct an ablation study to analyze the contri-
can be a useful feature in text classification. On the other hand,
bution of each module in our framework.
removing key components of the proposed framework, such as
In Table 3, the classification accuracy values obtained by the
domain-specific knowledge, contextual node embedding, dynamic
baseline models and the proposed framework have been pre-
text sequential feature interaction, and attention-based graph
sented. The experimental results indicate that the proposed model
learning, all result in a significant decrease in performance. This
achieves the highest accuracy on all datasets. The second highest
suggests that each component of the proposed framework plays
predictive performance among the compared models has been
an important role in improving the classification performance.
achieved by Text-FCG + BERT algorithm, which is a graph and
The experimental results show that the proposed framework is a
transformer-based model. Then, Bi-LSTM and BERT + GAT are the
promising approach for text classification tasks, and the ablation
second and the third highest performing models, respectively.
study sheds light on the contributions of different components to
The empirical results indicate that the utilization of graph-based
the overall performance.
approaches to text classification is effective for capturing the com-
In Fig. 3, the histogram of accuracy for the compared models has
plex relationshipts between words and producing accurate predic-
been presented. Based on the results, we can see that the proposed
tions. On the lower end of the spectrum, we see models such as
model achieves the highest performance in all benchmark datasets
Text-FCG, fastText, and SWEM, which achieve relatively low accu-
with a significant margin compared to the other models. It
racy scores compared to the other models. These models typically
achieves an accuracy of 93.71% for 20 NG, 94.18% for Airline Twit-
rely on simpler approaches to text classification, such as word
ter, 96.97% for App, 89.30% for MR, 74.87% for Ohsumed, 98.75% for
embedding-based representations. In Table 3, the empirical results
R8, 94.11% for R52, and 98.32% for Sarcasm. Among the baseline
for the model variants have been also presented. The results show
models, BERT + GAT, Bi-LSTM, and BERT also achieved relatively
that the proposed framework outperforms all model variants and
high performance, with BERT + GAT and Bi-LSTM outperforming
BERT in most of the datasets. The proposed model uses hierarchical
Table 2 graph-based text classification with contextual node embedding
The parameter settings for the proposed framework. and BERT-based dynamic fusion. It also outperforms the variants
Stage Parameter Default of the proposed model that exclude certain components, such as
Value LFE based on POS tagging only, LFE based on dependency parsing
Linguistic Feature Extraction Maximum Sequence 128
only, LFE based on NER only, LFE based on coreference resolution
Length only, and HNC without domain-specific knowledge. It is also inter-
Hierarchical Node Construction with Window Size for 512 esting to note that some models that use graph-based approaches,
Domain-specific Knowledge Document Splitting such as HyperGAT, TensorGCN, TextGCN, and Text-level GNN, per-
Hierarchical Node Construction with Window Size for 64
form worse than some models that do not use such approaches,
Domain-specific Knowledge Sentence Splitting
Contextual Node Embedding Pre-trained Language BERT such as Text-FCG, SWEM, and fastText. This suggests that while
Model graph-based approaches can be effective, they may not always be
Contextual Node Embedding Batch Size 32 the best choice for text classification tasks.
Contextual Node Embedding Learning Rate 2,00E-05 Based on the design approach of the baseline models and the
Multi-level Graph Learning Number of Levels 3
Multi-level Graph Learning Threshold Distance for 2
proposed framework, we have mainly classified all the compared
Neighbor Nodes schemes into six approaches, as graph and transformer-based
Multi-level Graph Learning GNN Hidden Size 256 approaches, graph-based approaches, inductive learning-based
Multi-level Graph Learning Attention Hidden Size 128 approaches, sequence-based approaches, transformer-based
Dynamic Text Sequential Feature Reference Sequence 10
approaches, and word embedding-based approaches. The first
Interaction Length
Dynamic Text Sequential Feature DTW Window Size 5 approach, graph and transformer-based, includes models that
Interaction combine both graph neural networks and transformer-based archi-
Attention-based Graph Learning Attention Hidden Size 128 tectures to process text data. The second approach, graph-based,
Attention-based Graph Learning Dropout Rate 0.5 includes models that use only graph neural networks to process
Fully-connected Layer Number of Hidden 1
Layers
text data. The third approach, inductive learning-based, includes
Fully-connected Layer Hidden Layer Size 256 models that use inductive learning techniques to generalize to
Fully-connected Layer Activation Function ReLU new, unseen examples of text data. The fourth approach,
Fully-connected Layer Output Size Number of sequence-based, includes models that use recurrent neural net-
Classes
works or other sequence-based models to process text data. The
12
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

Table 3
The classification accuracy values obtained by the models.

Model 20 NG Airline Twitter App MR Ohsumed R8 R52 Sarcasm


BERT 82,3543 84,6916 86,8194 83,7008 70,5941 91,0623 87,3277 89,2193
BERT + GAT 83,4947 85,9176 88,7381 85,3084 71,0906 92,1175 89,0413 91,9237
Bi-LSTM 83,6220 86,9721 88,7610 85,4859 71,3481 92,1876 90,8577 92,2417
CGA2TC 79,9378 80,6012 83,7485 80,9234 68,1620 86,2379 84,6360 84,8943
CNN-non-static 79,6785 80,2287 83,1909 80,2140 67,9369 86,2335 84,5957 84,5009
fastText 78,6231 79,3416 82,4813 79,2438 66,5359 84,5157 83,4018 84,0179
HyperGAT 80,3591 82,8367 85,0237 81,9715 68,8696 88,4086 85,4351 87,3538
Model variant-1 89,3129 93,4062 96,4852 88,9155 74,4210 95,8838 92,9030 98,0199
Model variant-2 84,0038 88,0422 89,1815 85,7718 71,9050 92,2727 91,0073 94,6713
Model variant-3 84,8058 88,2002 89,9708 86,4900 71,9596 92,4000 91,1597 95,6183
Model variant-4 85,6385 92,0047 92,6563 86,7714 73,4522 95,3171 91,2966 96,3640
Model variant-5 81,0457 83,1874 85,0640 82,1760 69,7983 88,9049 85,6177 87,4934
Model variant-6 81,4721 84,2529 85,2457 82,6272 70,0614 90,2421 85,7046 88,3450
Model variant-7 82,4477 84,8885 87,5875 84,1844 70,9467 91,1592 88,0228 89,8291
Model variant-8 82,9909 85,7039 88,1029 84,4073 71,0176 91,8470 88,6887 89,8308
Model variant-9 80,1005 81,4217 84,5173 81,7852 68,2955 87,2610 85,1003 86,1224
SWEM 78,7725 79,9446 82,7520 79,6498 66,8541 85,1530 83,6203 84,3090
TensorGCN 80,0721 81,1619 84,4143 81,2751 68,1784 87,2190 84,7137 85,5884
Text-FCG 78,4636 79,2525 82,3393 78,6838 66,5095 84,3647 82,9146 84,0070
Text-FCG + BERT 85,6728 93,0309 95,7353 87,1919 74,0812 95,4332 91,6306 96,7459
TextGCN 79,5181 80,2078 82,9734 80,1814 67,1227 86,0711 84,0130 84,4858
TextING 81,6139 84,3717 86,4480 83,2304 70,3486 90,5319 86,5721 89,0219
TextING-M 81,6196 84,4533 86,7509 83,5223 70,4449 90,8804 86,9398 89,1190
Text-level GNN 78,7489 79,8716 82,5709 79,4437 66,7660 84,8542 83,4812 84,1922
TextSSL 81,5012 84,3559 86,1912 82,9404 70,1594 90,3408 86,5400 88,7668
T-VGAE 80,2750 82,1226 84,7842 81,8806 68,4622 88,0796 85,1459 86,3845
Proposed model 93,7120 94,1785 96,9717 89,2973 74,8684 98,7528 94,1073 98,3166

Fig. 3. The histogram of accuracy values for the models.

fifth approach, transformer-based, includes models that use be observed from the empirical results presented in Fig. 4, the uti-
transformer-based architectures, such as the popular BERT model, lizing both graph neural networks and transformer-based architec-
to process text data. The sixth and final approach, word tures to process text data can yield promising results.
embedding-based, includes models that use word embedding tech- In Fig. 5, the interaction plot of accuracy values for different
niques, such as fastText or GloVe, to process text data. By catego- datasets has been presented. Dataset characteristics play a signifi-
rizing the models into these approaches, the study provides a cant role in determining the performance of models. For instance,
framework for understanding the different design choices and the proposed model achieved the highest performance in all data-
methodologies used in the field of text classification, and can help sets, which could be attributed to the complexity of the datasets
guide future research in this area. In Fig. 4, the interval plot of accu- and the need for sophisticated models to handle them. Among
racy values for different approaches has been presented. As it can the compared models, graph and transformer-based approaches

13
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

may not have encountered before. In this section, we will explore


how different models perform in inductive learning scenarios
and evaluate their adaptability to new types of text. For the induc-
tive learning, we considered different portions of the training per-
centage to evaluate the adaptability of the models. Specifically, we
analyzed the performance of the models in the presence of unseen
words and graph structures during training. By gradually increas-
ing the percentage of training data, we observed that all the eval-
uated methods were able to enhance their performance as
illustrated in Fig. 6. In Fig. 7, the models have been compared in
terms of their performance in inductive text classification. Our pro-
posed scheme outperformed the other baselines significantly,
demonstrating its effectiveness in inductive text classification.
In addition to classification accuracy values, we have also pre-
sented the precision, recall, and the F1-score values for the com-
pared models in Table 4, Table 5, and Table 6, respectively.
Precision is a metric that measures the proportion of correctly
Fig. 4. The interval plot of accuracy values for different approaches.
classified instances out of all instances that the model predicted
as positive for a particular class. In Table 4, precision values are
and transformer-based approaches achieved the highest perfor- reported for various models on different datasets. The proposed
mance in most datasets. This could be due to the ability of these model consistently outperforms other models across all datasets,
models to capture both local and global dependencies in the data. achieving precision values above 90% and even reaching 99.75%
Models that combine different approaches, such as Text- on the R8 dataset. This indicates that the proposed model is effec-
FCG + BERT, achieved high performance in most datasets. This sug- tive in accurately classifying text data, while also being adaptable
gests that combining different methods can be an effective strategy to different types of datasets. Additionally, it is observed that deep
for improving performance. Word embedding-based approaches, learning-based models such as BERT, Bi-LSTM, and their variants
such as fastText and SWEM, achieved lower performance com- perform relatively better than other traditional machine learning
pared to other models in most datasets. This could be due to their models like fastText and SWEM. The same patterns are also valid
inability to capture the sequential and structural information in for the results presented in Table 5 and Table 6 for the recall and
the data. The performance of models varied significantly across the F1-score.
datasets, indicating the importance of dataset selection in the To further examine the statistical validity of the results, we
development and evaluation of models. The proposed model have performed two-way ANOVA test in Minitab statistical soft-
achieved exceptionally high performance in the Sarcasm dataset, ware. In Table 7, the two-way ANOVA test results for accuracy val-
which suggests that the proposed framework could be effective ues have been presented. The ANOVA (analysis of variance) test is a
in handling datasets with a high level of complexity and subtlety. statistical technique used to determine whether the mean of a
Inductive learning is a type of machine learning where the goal dependent variable differs significantly across different groups or
is to generalize from a limited set of examples to a broader set of levels of one or more independent variables. In this case, the
unseen examples. In the context of text classification, inductive ANOVA test is being used to analyze the effects of three indepen-
learning is particularly challenging because the models need to dent variables (model, induction, and dataset) on the dependent
be able to adapt to new types of text and language styles that they variable (accuracy). Table 7 presented the results of a two-way

Fig. 5. The interaction plot of accuracy values for different datasets.

14
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

ANOVA test, which considers the interaction between two inde-


pendent variables. The table provides the degrees of freedom
(DF), adjusted sum of squares (Adj SS), adjusted mean squares
(Adj MS), F-value, and p-value for each source of variation. The
‘‘Model” row in the ANOVA table indicates the effect of different
models on the accuracy scores. The F-value of 448.58 and the very
small p-value (0.000) indicate that the mean accuracy score differs
significantly across different models. The ‘‘Induction” row in the
ANOVA table shows the effect of different training percentages
on the accuracy scores. The F-value of 35896.21 and the very small
p-value (0.000) indicate that the mean accuracy score differs sig-
nificantly across different training percentages. The ‘‘Dataset”
row in the ANOVA table indicates the effect of different datasets
on the accuracy scores. The F-value of 8997.32 and the very small
p-value (0.000) indicate that the mean accuracy score differs sig-
nificantly across different datasets. The ‘‘Model*Induction” row
shows the interaction effect between models and induction. The
Fig. 6. The main effects plot of accuracy values for training percentages.
F-value of 38.52 and the very small p-value (0.000) indicate that
the interaction between models and induction significantly affects
accuracy scores. The ‘‘Model*Dataset” row shows the interaction
effect between models and dataset. The F-value of 2.06 and the
very small p-value (0.000) indicate that the interaction between
models and dataset also significantly affects accuracy scores,
although to a lesser extent than the other sources of variation.
Finally, the ‘‘Induction*Dataset” row shows the interaction effect
between induction and dataset. The F-value of 115.11 and the very
small p-value (0.000) indicate that the interaction between induc-
tion and dataset significantly affects accuracy scores.
Fig. 8 presents an interval plot of accuracy for the compared
methods. The plot shows the range of accuracy values for each
model, along with a dashed line indicating the threshold for statis-
tical significance. As shown in the plot, the proposed scheme
achieves higher predictive performance values, which are located
above the right dashed line, indicating that the results are statisti-
cally significant. This means that the proposed scheme signifi-
cantly outperforms the other compared methods in terms of
Fig. 7. The interaction plot of accuracy values for training percentages. accuracy.

Table 4
The precision values obtained by the models.

Model 20 NG Airline Twitter App MR Ohsumed R8 R52 Sarcasm


BERT 0,8319 0,8555 0,8770 0,8455 0,7131 0,9198 0,8821 0,9012
BERT + GAT 0,8434 0,8679 0,8963 0,8617 0,7181 0,9305 0,8994 0,9285
Bi-LSTM 0,8447 0,8785 0,8966 0,8635 0,7207 0,9312 0,9178 0,9317
CGA2TC 0,8075 0,8142 0,8459 0,8174 0,6885 0,8711 0,8549 0,8575
CNN-non-static 0,8048 0,8104 0,8403 0,8102 0,6862 0,8710 0,8545 0,8535
fastText 0,7942 0,8014 0,8331 0,8004 0,6721 0,8537 0,8424 0,8487
HyperGAT 0,8117 0,8367 0,8588 0,8280 0,6957 0,8930 0,8630 0,8824
Model variant-1 0,9022 0,9435 0,9746 0,8981 0,7517 0,9685 0,9384 0,9901
Model variant-2 0,8485 0,8893 0,9008 0,8664 0,7263 0,9320 0,9193 0,9563
Model variant-3 0,8566 0,8909 0,9088 0,8736 0,7269 0,9333 0,9208 0,9658
Model variant-4 0,8650 0,9293 0,9359 0,8765 0,7419 0,9628 0,9222 0,9734
Model variant-5 0,8186 0,8403 0,8592 0,8301 0,7050 0,8980 0,8648 0,8838
Model variant-6 0,8230 0,8510 0,8611 0,8346 0,7077 0,9115 0,8657 0,8924
Model variant-7 0,8328 0,8575 0,8847 0,8503 0,7166 0,9208 0,8891 0,9074
Model variant-8 0,8383 0,8657 0,8899 0,8526 0,7173 0,9277 0,8958 0,9074
Model variant-9 0,8091 0,8224 0,8537 0,8261 0,6899 0,8814 0,8596 0,8699
SWEM 0,7957 0,8075 0,8359 0,8045 0,6753 0,8601 0,8446 0,8516
TensorGCN 0,8088 0,8198 0,8527 0,8210 0,6887 0,8810 0,8557 0,8645
Text-FCG 0,7926 0,8005 0,8317 0,7948 0,6718 0,8522 0,8375 0,8486
Text-FCG + BERT 0,8654 0,9397 0,9670 0,8807 0,7483 0,9640 0,9256 0,9772
TextGCN 0,8032 0,8102 0,8381 0,8099 0,6780 0,8694 0,8486 0,8534
TextING 0,8244 0,8522 0,8732 0,8407 0,7106 0,9145 0,8745 0,8992
TextING-M 0,8244 0,8531 0,8763 0,8437 0,7116 0,9180 0,8782 0,9002
Text-level GNN 0,7954 0,8068 0,8340 0,8025 0,6744 0,8571 0,8432 0,8504
TextSSL 0,8232 0,8521 0,8706 0,8378 0,7087 0,9125 0,8741 0,8966
T-VGAE 0,8109 0,8295 0,8564 0,8271 0,6915 0,8897 0,8601 0,8726
Proposed model 0,9466 0,9513 0,9795 0,9020 0,7562 0,9975 0,9506 0,9931

15
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

Table 5
The recall values obtained by the models.

Model 20 NG Airline Twitter App MR Ohsumed R8 R52 Sarcasm


BERT 0,8404 0,8642 0,8859 0,8541 0,7203 0,9292 0,8911 0,9104
BERT + GAT 0,8520 0,8767 0,9055 0,8705 0,7254 0,9400 0,9086 0,9380
Bi-LSTM 0,8533 0,8875 0,9057 0,8723 0,7280 0,9407 0,9271 0,9412
CGA2TC 0,8157 0,8225 0,8546 0,8257 0,6955 0,8800 0,8636 0,8663
CNN-non-static 0,8130 0,8187 0,8489 0,8185 0,6932 0,8799 0,8632 0,8623
fastText 0,8023 0,8096 0,8416 0,8086 0,6789 0,8624 0,8510 0,8573
HyperGAT 0,8200 0,8453 0,8676 0,8364 0,7028 0,9021 0,8718 0,8914
Model variant-1 0,9114 0,9531 0,9845 0,9073 0,7594 0,9784 0,9480 0,9878
Model variant-2 0,8572 0,8984 0,9100 0,8752 0,7337 0,9416 0,9286 0,9660
Model variant-3 0,8654 0,9000 0,9181 0,8826 0,7343 0,9429 0,9302 0,9757
Model variant-4 0,8739 0,9388 0,9455 0,8854 0,7495 0,9726 0,9316 0,9833
Model variant-5 0,8270 0,8489 0,8680 0,8385 0,7122 0,9072 0,8736 0,8928
Model variant-6 0,8313 0,8597 0,8699 0,8431 0,7149 0,9208 0,8745 0,9015
Model variant-7 0,8413 0,8662 0,8938 0,8590 0,7239 0,9302 0,8982 0,9166
Model variant-8 0,8468 0,8745 0,8990 0,8613 0,7247 0,9372 0,9050 0,9166
Model variant-9 0,8174 0,8308 0,8624 0,8345 0,6969 0,8904 0,8684 0,8788
SWEM 0,8038 0,8158 0,8444 0,8128 0,6822 0,8689 0,8533 0,8603
TensorGCN 0,8171 0,8282 0,8614 0,8293 0,6957 0,8900 0,8644 0,8734
Text-FCG 0,8006 0,8087 0,8402 0,8029 0,6787 0,8609 0,8461 0,8572
Text-FCG + BERT 0,8742 0,9493 0,9769 0,8897 0,7559 0,9738 0,9350 0,9872
TextGCN 0,8114 0,8184 0,8467 0,8182 0,6849 0,8783 0,8573 0,8621
TextING 0,8328 0,8609 0,8821 0,8493 0,7178 0,9238 0,8834 0,9084
TextING-M 0,8329 0,8618 0,8852 0,8523 0,7188 0,9274 0,8871 0,9094
Text-level GNN 0,8036 0,8150 0,8426 0,8107 0,6813 0,8659 0,8518 0,8591
TextSSL 0,8316 0,8608 0,8795 0,8463 0,7159 0,9218 0,8831 0,9058
T-VGAE 0,8191 0,8380 0,8651 0,8355 0,6986 0,8988 0,8688 0,8815
Proposed model 0,9562 0,9610 0,9895 0,9112 0,7640 0,9872 0,9603 0,9890

Table 6
The F1-score values obtained by the models.

Model 20 NG Airline Twitter App MR Ohsumed R8 R52 Sarcasm


BERT 0,8361 0,8598 0,8814 0,8498 0,7167 0,9245 0,8866 0,9058
BERT + GAT 0,8477 0,8723 0,9009 0,8661 0,7217 0,9352 0,9040 0,9332
Bi-LSTM 0,8490 0,8830 0,9011 0,8679 0,7243 0,9359 0,9224 0,9365
CGA2TC 0,8116 0,8183 0,8502 0,8216 0,6920 0,8755 0,8592 0,8619
CNN-non-static 0,8089 0,8145 0,8446 0,8144 0,6897 0,8755 0,8588 0,8579
fastText 0,7982 0,8055 0,8374 0,8045 0,6755 0,8580 0,8467 0,8530
HyperGAT 0,8158 0,8410 0,8632 0,8322 0,6992 0,8975 0,8674 0,8868
Model variant-1 0,9067 0,9483 0,9795 0,9027 0,7555 0,9734 0,9432 0,9889
Model variant-2 0,8528 0,8938 0,9054 0,8708 0,7300 0,9368 0,9239 0,9611
Model variant-3 0,8610 0,8954 0,9134 0,8781 0,7306 0,9381 0,9255 0,9707
Model variant-4 0,8694 0,9341 0,9407 0,8809 0,7457 0,9677 0,9269 0,9783
Model variant-5 0,8228 0,8445 0,8636 0,8343 0,7086 0,9026 0,8692 0,8883
Model variant-6 0,8271 0,8554 0,8654 0,8389 0,7113 0,9162 0,8701 0,8969
Model variant-7 0,8370 0,8618 0,8892 0,8547 0,7203 0,9255 0,8936 0,9120
Model variant-8 0,8425 0,8701 0,8944 0,8569 0,7210 0,9325 0,9004 0,9120
Model variant-9 0,8132 0,8266 0,8580 0,8303 0,6934 0,8859 0,8640 0,8743
SWEM 0,7997 0,8116 0,8401 0,8086 0,6787 0,8645 0,8489 0,8559
TensorGCN 0,8129 0,8240 0,8570 0,8251 0,6922 0,8855 0,8600 0,8689
Text-FCG 0,7966 0,8046 0,8359 0,7988 0,6752 0,8565 0,8418 0,8529
Text-FCG + BERT 0,8698 0,9445 0,9719 0,8852 0,7521 0,9689 0,9303 0,9822
TextGCN 0,8073 0,8143 0,8424 0,8140 0,6814 0,8738 0,8529 0,8577
TextING 0,8286 0,8566 0,8776 0,8450 0,7142 0,9191 0,8789 0,9038
TextING-M 0,8286 0,8574 0,8807 0,8479 0,7152 0,9226 0,8826 0,9048
Text-level GNN 0,7995 0,8109 0,8383 0,8065 0,6778 0,8615 0,8475 0,8547
TextSSL 0,8274 0,8564 0,8750 0,8420 0,7123 0,9172 0,8786 0,9012
T-VGAE 0,8150 0,8337 0,8608 0,8313 0,6950 0,8942 0,8644 0,8770
Proposed model 0,9514 0,9561 0,9845 0,9066 0,7601 0,9923 0,9554 0,9910

Table 7
The two-way ANOVA test results by the models.

Source DF Adj SS Adj MS F-Value P-Value


Model 26 4607,8 177,2 448,58 0,000
Induction 4 56726,9 14181,7 35896,21 0,000
Dataset 7 24882,3 3554,6 8997,32 0,000
Model*Induction 104 1582,9 15,2 38,52 0,000
Model*Dataset 182 148,2 0,8 2,06 0,000
Induction*Dataset 28 1273,4 45,5 115,11 0,000
Error 728 287,6 0,4
Total 1079 89509,1

16
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

achieved state-of-the-art performance on various benchmark data-


sets, outperforming other baseline models and approaches.
Through an extensive experimental evaluation, we demonstrated
the effectiveness and generalizability of our proposed framework
for text classification tasks. We also conducted an ablation study
to analyze the contribution of each module in our framework
and found that each component plays an important role in improv-
ing classification performance. Our results showed that combining
multiple types of linguistic features is beneficial for text classifica-
tion, and models that rely on only one type of feature extraction
perform worse than the proposed framework. Furthermore, pre-
trained language models like BERT can provide strong feature rep-
resentations for text classification tasks. We also categorized the
models into six approaches, including graph and transformer-
based, graph-based, inductive learning-based, sequence-based,
transformer-based, and word embedding-based. By analyzing the
empirical results, we found that utilizing both graph neural net-
Fig. 8. The interval plot of accuracy values for compared models. works and transformer-based architectures to process text data
can yield promising results. We also observed that dataset charac-
There are some more detailed managerial insights that can be teristics play a significant role in determining the performance of
drawn from the empirical results: models, and models that combine different approaches can be an
effective strategy for improving performance. Finally, we per-
 The proposed text classification framework, which is based on formed a two-way ANOVA test to further examine the statistical
hierarchical graphs, contextual node embedding, and dynamic validity of the results, which showed that the mean accuracy score
fusion with BERT, achieves the highest accuracy on all bench- differs significantly across different models, training percentages,
mark datasets compared to the baseline models. and datasets. The interval plot of accuracy values for the compared
 Models that combine different approaches, such as graph-based methods also showed that the proposed scheme achieves higher
and transformer-based architectures, achieve higher perfor- predictive performance values, which are statistically significant.
mance compared to models that rely on only one approach. The proposed text classification framework has several potential
 Pre-trained language models like BERT can provide strong fea- applications in the field of natural language processing, including
ture representations for text classification tasks and achieve sentiment analysis, topic modeling, and opinion mining. The find-
high performance in most datasets. ings of this study can provide insights for researchers and practi-
 Combining multiple types of linguistic features, such as part-of- tioners in developing effective and accurate text classification
speech tags and named entity recognition, can improve the per- models. Future work could explore the use of other types of lin-
formance of text classification models. guistic features, such as syntax and semantics, and investigate
 Domain-specific knowledge, contextual node embedding, the impact of different types of graphs on classification perfor-
dynamic text sequential feature interaction, and attention- mance. Additionally, the proposed framework could be extended
based graph learning are important components of the pro- to other languages and domains beyond English and general texts.
posed framework and significantly contribute to its overall
performance.
Declaration of Competing Interest
 The proposed framework is effective in inductive text classifica-
tion and outperforms other baseline models in this scenario.
The authors declare that they have no known competing finan-
 Deep learning-based models, such as BERT and Bi-LSTM, per-
cial interests or personal relationships that could have appeared
form relatively better than other traditional machine learning
to influence the work reported in this paper.
models, such as fastText and SWEM.
 The proposed framework achieves high precision, recall, and F1-
score values, indicating its effectiveness in accurately classify- References
ing text data.
Aggarwal, C.C., Zhai, C., 2012. A survey of text classification algorithms. Mining text
 The empirical results indicate that dataset selection is impor- data, 163–222.
tant in the development and evaluation of models, and that Chen, Y., 2015. Convolutional Neural Network for Sentence Classification.
the performance of models can vary significantly across University of Waterloo. Master’s thesis.
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., Salakhutdinov, R., 2019.
datasets. Transformer-xl: Attentive language models beyond a fixed-length context.
 Graph-based models, such as HyperGAT, TensorGCN, TextGCN, arXiv preprint arXiv:1901.02860.
and Text-level GNN, perform worse than models that do not Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint
use such approaches, such as Text-FCG, SWEM, and fastText. arXiv:1810.04805.
 The proposed framework achieves exceptionally high perfor- Ding, K., Wang, J., Li, J., Li, D., Liu, H., 2020. Be more with less: Hypergraph attention
mance on the Sarcasm dataset, indicating its effectiveness in networks for inductive text classification. arXiv preprint arXiv:2011.00387.
He, R., McAuley, J., 2016. Ups and downs: Modeling the visual evolution of fashion
handling complex and subtle datasets.
trends with one-class collaborative filtering. In: Proceedings of the 25th
 LFE based on POS tagging significantly contributes to the overall International Conference on World Wide Web, pp. 507–517.
performance of the proposed model. Hersh, W., Buckley, C., Leone, T.J., Hickam, D., 1994. OHSUMED: An interactive
retrieval evaluation and new large test collection for research. In: SIGIR’94:
Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on
5. Conclusion Research and Development in Information Retrieval, organised by Dublin City
University. Springer, London, pp. 192–201.
In conclusion, this study proposed a novel text classification Howard, J., Ruder, S., 2018. Universal language model fine-tuning for text
classification. arXiv preprint arXiv:1801.06146.
framework based on hierarchical graphs, contextual node embed- Hu, D., 2020. An introductory survey on attention mechanisms in NLP problems. In:
ding, and dynamic fusion with BERT. The proposed framework Intelligent Systems and Applications: Proceedings of the 2019 Intelligent

17
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101610

Systems Conference (IntelliSys) Volume 2. Springer International Publishing, Piao, Y., Lee, S., Lee, D., Kim, S., 2022. Sparse structure learning via graph neural
pp. 432–448. networks for inductive document classification. In: Proceedings of the AAAI
Huang, L., Ma, D., Li, S., Zhang, X., Wang, H., 2019. Text level graph neural network Conference on Artificial Intelligence, vol. 36, No. 10, pp. 11165–11173.
for text classification. arXiv preprint arXiv:1910.02356. Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X., 2020. Pre-trained models for
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T., 2016. Bag of tricks for efficient text natural language processing: A survey. Sci. China Technol. Sci. 63 (10), 1872–
classification. arXiv preprint arXiv:1607.01759. 1897.
Kim, Y., 2014. Convolutional neural networks for sentence classification. In: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language
Moschitti, A., Pang, B., Daelemans, W. (Eds.), Proceedings of the 2014 models are unsupervised multitask learners. OpenAI blog 1 (8), 9.
Conference on Empirical Methods in Natural Language Processing, EMNLP Ragesh, R., Sellamanickam, S., Iyer, A., Bairi, R., Lingam, V., 2021. March). Hetegcn:
2014, October 25-29, 2014, Doha, Qatar, a Meeting of SIGDAT, A Special Interest heterogeneous graph convolutional networks for text classification. In:
group of the ACL, ACL, pp. 1746–1751. Proceedings of the 14th ACM International Conference on Web Search and
Koncel-Kedziorski, R., Bekal, D., Luan, Y., Lapata, M., Hajishirzi, H., 2019. Text Data Mining, pp. 860–868.
generation from knowledge graphs with graph transformers. arXiv preprint Rousseau, F., Kiagias, E., Vazirgiannis, M., 2015. July). Text categorization as a graph
arXiv:1904.02342. classification problem. In: Proceedings of the 53rd Annual Meeting of the
Korde, V., Mahender, C.N., 2012. Text classification and classifiers: A survey. Int. J. Association for Computational Linguistics and the 7th International Joint
Artif. Intell. Appl. 3 (2), 85. Conference on Natural Language Processing (Volume 1: Long Papers), pp.
Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., et al., 2020. A survey on text 1702–1712.
classification: From shallow to deep learning. arXiv preprint arXiv:2008.00364. Shen, D., Wang, G., Wang, W., Min, M.R., Su, Q., Zhang, Y.,et al., 018. Baseline needs
Li, J., Tang, T., Zhao, W.X., Wen, J.R., 2021. Pretrained language models for text more love: On simple word-embedding-based models and associated pooling
generation: A survey. arXiv preprint arXiv:2105.10311. mechanisms. arXiv preprint arXiv:1805.09843.
Lin, Y., Meng, Y., Sun, X., Han, Q., Kuang, K., Li, J., Wu, F., 2021. Bertgcn: Vashishth, S., Yadati, N., Talukdar, P., 2020. Graph-based deep learning in natural
Transductive text classification by combining gcn and bert. arXiv preprint language processing. In: Proceedings of the 7th ACM IKDD CoDS and 25th
arXiv:2105.05727. COMAD, pp. 371–372.
Liu, L., Finch, A., Utiyama, M., Sumita, E., 2016. Agreement on target-bidirectional Wan, Y., Gao, Q., 2015. An ensemble sentiment classification system of twitter data
LSTMs for sequence-to-sequence learning. In: Proceedings of the AAAI for airline services analysis. In: 2015 IEEE international conference on data
Conference on Artificial Intelligence, vol. 30, no. 1. mining workshop (ICDMW). IEEE, pp. 1318–1325.
Liu, X., You, X., Zhang, X., Wu, J., Lv, P., 2020. Tensor graph convolutional networks Wang, Y., Wang, C., Zhan, J., Ma, W., Jiang, Y., 2023. Text FCG: Fusing Contextual
for text classification. In; Proceedings of the AAAI Conference on Artificial Information via Graph Learning for text classification. Expert Syst. Appl.,
Intelligence, vol. 34, no. 05, pp. 8409–8416. 119658
Liu, B., Wu, L., 2022. Graph neural networks in natural language processing. Graph Wu, L., Chen, Y., Ji, H., Liu, B., 2021. Deep learning on graphs for natural language
Neural Networks: Found Front. Appl. 12, 463–481. processing. In: Proceedings of the 44th International ACM SIGIR Conference on
Malekzadeh, M., Hajibabaee, P., Heidari, M., Zad, S., Uzuner, O., Jones, J.H., 2021. Research and Development in Information Retrieval, pp. 2651–2653.
Review of graph neural network in text classification. In: 2021 IEEE 12th Annual Wu, L., Chen, Y., Shen, K., Guo, X., Gao, H., Li, S., Long, B., 2023. Graph neural
Ubiquitous Computing, Electronics & Mobile Communication Conference networks for natural language processing: A survey. Found. TrendsÒ Mach.
(UEMCON). IEEE, pp. 0084–0091. Learn. 16 (2), 119–328.
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J., 2021. Wu, H., Liu, Y., Wang, J., 2020. Review of text classification methods on deep
Deep learning–based text classification: a comprehensive review. ACM Comput. learning. Comput. Mater. Continua 63 (3), 1309.
Surv. (CSUR) 54 (3), 1–40. Xie, Q., Huang, J., Du, P., Peng, M., Nie, J.Y., 2021. June). Inductive topic variational
Niu, Z., Zhong, G., Yu, H., 2021. A review on the attention mechanism of deep graph auto-encoder for text classification. In: Proceedings of the 2021
learning. Neurocomputing 452, 48–62. Conference of the North American Chapter of the Association for
Onan, A., 2019. Topic-enriched word embeddings for sarcasm identification. Computational Linguistics: Human Language Technologies, pp. 4218–4227.
Software Engineering Methods in Intelligent Algorithms: Proceedings of 8th Yang, Y., Miao, R., Wang, Y., Wang, X., 2022. Contrastive Graph Convolutional
Computer Science On-line Conference 2019, vol. 18. Springer International Networks with adaptive augmentation for text classification. Inf. Process.
Publishing, pp. 293–304. Manag. 59, (4) 102946.
Onan, A., 2022. Bidirectional convolutional recurrent neural network architecture Yao, L., Mao, C., Luo, Y., 2019. Graph convolutional networks for text classification.
with group-wise enhancement mechanism for text sentiment classification. J. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01,
King Saud Un.-Comput. Informat. Sci. 34 (5), 2098–2117. pp. 7370–7377.
Otter, D.W., Medina, J.R., Kalita, J.K., 2020. A survey of the usages of deep learning Zhang, Y., Yu, X., Cui, Z., Wu, S., Wen, Z., Wang, L., 2020. Every document owns its
for natural language processing. IEEE Trans. Neural Networks Learn. Syst. 32 (2), structure: Inductive text classification via graph neural networks. arXiv
604–624. preprint arXiv:2004.13826.
Pang, B., Lee, L., 2005. Seeing stars: Exploiting class relationships for sentiment Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Sun, M., 2020. Graph neural
categorization with respect to rating scales. arXiv preprint cs/0506075. networks: A review of methods and applications. AI open 1, 57–81.

18

You might also like