0% found this document useful (0 votes)
32 views14 pages

Edge Enhanced Minimum Margin Graph Attention Networ - 2024 - Expert Systems With

With the rapid advancement of the internet, there has been a dramatic increase in short-text data. Due to the brevity of short texts, sparse features, and limited contextual information, short-text classification has become a challenging task in natural language processing. However, current methods primarily capture semantic information from locally-sequenced words in short text, which ignores the intricate feature relationships that pervade both the intra-text and inter-text. Therefore, this pa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views14 pages

Edge Enhanced Minimum Margin Graph Attention Networ - 2024 - Expert Systems With

With the rapid advancement of the internet, there has been a dramatic increase in short-text data. Due to the brevity of short texts, sparse features, and limited contextual information, short-text classification has become a challenging task in natural language processing. However, current methods primarily capture semantic information from locally-sequenced words in short text, which ignores the intricate feature relationships that pervade both the intra-text and inter-text. Therefore, this pa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Expert Systems With Applications 251 (2024) 124069

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Edge-enhanced minimum-margin graph attention network for short text


classification
Wei Ai a , Yingying Wei a , Hongen Shao a , Yuntao Shou a , Tao Meng a ,∗, Keqin Li b
a
College of Computer and Mathematics, Central South University of Forestry and Technology, Hunan 410004, China
b
Department of Computer Science, State University of New York, New Paltz, NY 12561, USA

ARTICLE INFO ABSTRACT

Keywords: With the rapid advancement of the internet, there has been a dramatic increase in short-text data. Due to the
Short text classification brevity of short texts, sparse features, and limited contextual information, short-text classification has become
Graph neural networks a challenging task in natural language processing. However, current methods primarily capture semantic
Attention mechanism
information from locally-sequenced words in short text, which ignores the intricate feature relationships that
Feature enhancement
pervade both the intra-text and inter-text. Therefore, this paper proposes a novel Edge-Enhanced Minimum-
Margin Graph Attention Network (EMGAN) for short text classification to address this issue. Specifically,
we construct a Heterogeneous Information Graph (HIG) to represent complex relationships among short text
features. HIG mainly considers the relationship between document features and three attribute features, such as
entities, topics, and keywords, and can represent short text features from multiple dimensions and levels. Then,
to enhance the connectivity and expressiveness of the HIG for more effective propagation of feature information
within it, we present a novel X-shaped structure edge-enhancement method. It enriches their relationships
by reconstructing the edge structures. Furthermore, we design a Minimum Margin Graph Attention Network
(MMGAN) for short text classification. Specifically, this method aims to explore the minimum margin between
high-order neighbors and central nodes at the minimum cost, efficiently extracting and aggregating feature
information. Extensive experimental results demonstrate that our proposed EMGAN model outperforms existing
methods on five datasets, validating its effectiveness in short-text classification. Our code is submitted at
https://github.com/w123yy/EMGAN.

1. Introduction Recently, deep neural networks have been proposed by researchers


and widely utilized in the task of short text classification, such as
During the era of information proliferation, natural language pro- convolutional neural networks (CNN) (Zhou, Li, Chi, Tang, & Zheng,
cessing (NLP) subtasks have undergone extensive scrutiny and found 2022) and recurrent neural networks (RNN) (Graves & Graves, 2012;
practical utility across many real-world predicaments (Hirschberg &
Zhou, Xu, Xu, Yang, & Li, 2016). Compared with traditional classifica-
Manning, 2015). Among these tasks, the challenge of text classifica-
tion models, these models have achieved significant progress in short
tion emerges as both a timeless quandary and an arduous undertak-
text classification (Pham, Nguyen, Pedrycz, & Vo, 2023). However,
ing (Chakraborty & Singh, 2022). As individuals increasingly acquire
and disseminate information through diverse applications and websites, these models mainly focus on modeling sequential structural features,
the succinct format of short texts, such as news tags, application which significantly limits their ability to handle heterogeneous rela-
reviews, instant messages, and tweets, has become an inseparable part tionships among features. Graph neural networks (GNN) can solve the
of our daily lives. Its pervasive influence extends to various domains, limitations of sequence models by explicitly modeling and utilizing
including news categorization, social media (Kateb & Kalita, 2015), the inherent graph structure of the data, and show excellent perfor-
sentiment analysis (Balomenos et al., 2005), e-commerce, and spam mance in processing complex semantic and topological information.
filtering. Consequently, the role of short text classification proves in- Therefore, transforming text into graph structures (Wang et al., 2022;
dispensable in information retrieval. In light of its exceptionally high Wu et al., 2020) has become an increasingly popular approach in text
practical value, scholars diligently devote their efforts to exploring
classification tasks. As shown in Fig. 1, in such studies, it is customary
diverse methodologies (Yu, Ho, Arunachalam, Somaiya, & Lin, 2012).

∗ Corresponding author.
E-mail addresses: [email protected] (W. Ai), [email protected] (Y. Wei), [email protected] (H. Shao), [email protected]
(Y. Shou), [email protected] (T. Meng), [email protected] (K. Li).

https://doi.org/10.1016/j.eswa.2024.124069
Received 18 December 2023; Received in revised form 5 April 2024; Accepted 18 April 2024
Available online 23 April 2024
0957-4174/© 2024 Elsevier Ltd. All rights reserved.
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

which can optimize the overall topology and accurately capture high-
order feature information, and is applied to heterogeneous information
graphs for short text classification. Specifically, we construct a novel
Heterogeneous Information Graph (HIG), which can well represent
short text features and their complex relationships. HIG simultaneously
considers entities, topics, and keywords as expanded features, address-
ing the inadequacy of short text features from multiple dimensions and
perspectives. Then, we incorporate an edge-enhancement technique
based on an 𝑋-shaped structure that reconstructs the edge structure
Fig. 1. A comparison of sequence and graph structures for modeling short text between nodes, enriching relationships and forming a high-order HIG
representations. In the sequence structure, the relationship between features is relatively with dense and rich features. Furthermore, we also design the Minimum
simple, generally related to the context where the feature is located. In the graph
Margin Graph Attention Network (MMGAN) to address the feature
structure, the relationship between features is more complex, and it is not limited
to the context where the features are located and can represent the deep semantic aggregation issue in short-text classification. It utilizes edge-based
relationship between features. higher-order attention, particularly focusing on exploring the minimal
margin between high-order neighbors and center nodes at the lowest
cost, facilitating feature extraction and aggregation, updating node
to construct a graph structure (Ragesh, Sellamanickam, Iyer, Bairi, & features, reducing noise interference, and addressing the issue of sparse
Lingam, 2021; Wang, Liu, Yang, Liu, & Wang, 2021) by treating text features in short texts. In short, EMGAN can effectively solve the sparse
features (e.g., keywords, entities.) and their corresponding relationships problem of short text features and significantly improve the model
as nodes and edges. This method can handle unstructured data, capture performance and classification accuracy.
correlations among different features, and effectively address issues The main contributions of this article can be summarized as follows:
such as sparse features and data imbalance by leveraging the graph
structure. By applying this approach, researchers like Joachims (2005) • We introduce a novel Heterogeneous Information Graph (HIG),
have achieved better classification results by exploring latent themes,
which takes document features as central nodes and considers
documents, and word-level graph operations in a corpus. This graph
three related attribute features: entities, topics, and keywords,
structure can efficiently represent interactions and associations within
which expands features from multiple dimensions, effectively
textual data, leading to improved semantic information capture and
addressing the limitations of short text features.
enhanced classification performance.
• Then, we incorporate an edge-enhancement technique based on
However, due to the concise nature of short text sentences and
an 𝑋-shaped structure that forms an 𝑋-shaped high-order het-
their sparse semantic features, as well as weak contextual associations,
erogeneous graph by reconstructing the edge connections be-
the task of short text classification becomes increasingly challenging.
Firstly, short texts require incorporating additional information and tween different central nodes. It enhances the connectivity of HIG,
utilizing external knowledge bases to enhance feature representation. thereby improving the propagation and interaction of feature
For example, Chen, Yao, and Yang (2016) used a seed topic model to information between nodes.
expand the information to solve the problem of sparse feature infor- • We design the Minimum Margin Graph Attention Network (MM-
mation. However, enriching the features solely through topic represen- GAN) for short text classification, which centers around the cen-
tation does not maximize the utilization of information, posing a key tral node and comprehensively explores the structure of the HIG
concern regarding how to effectively augment feature information and at the lowest cost. It effectively aggregates the content of distant
semantic associations. Secondly, existing short text classification meth- neighbor nodes to supplement the central node with rich feature
ods based on graph convolutional neural networks (Pham et al., 2023) information, thus resolving the issue of feature sparsity.
often focus on aggregating first-order neighbor information within • We perform comprehensive experiments on real-world datasets
each layer while overlooking the capture of long-distance higher-order encompassing news articles, concise comments, and search snip-
semantics. Dealing with distant information propagation often requires pets to assess the efficacy of our model in comparison to eleven
multiple stacked layers, For example, Zhang, He, and Zhang (2022) baseline approaches. The experimental findings unequivocally
used multi-layer GCN to learn the features of the graph, leading to establish that our model surpasses the current state-of-the-art
convergence issues and the potential loss of feature information. Hence, baseline methods on the benchmark datasets.
obtaining distant information is a worthy research challenge. Thirdly,
short texts need more training data in practical scenarios, and manual Due to the pervasive nature of short text across various domains but
annotation consumes time. In order to improve the classification effect, the challenge of sparse feature information, we propose EMGAN, which
many scholars adopt a semi-supervised method based on graph neural introduces a novel approach integrating Heterogeneous Information
network (GNN) to classify short texts with limited labeled data (Ai,
Graph (HIG), edge-enhancement technology, and Minimum Margin
Wang, Shao, Meng, & Li, 2023; Linmei, Yang, Shi, Ji, & Li, 2019).
Graph Attention Network. The aim is to offer a richer understanding
Among them, Wang, Wang, Yao, and Dou (2021) proposed a semi-
of short text content, thereby significantly enhancing the accuracy
supervised method for classifying brief texts using a heterogeneous
and effectiveness of text classification. In summary, EMGAN repre-
graph neural network. This approach effectively utilizes limited la-
sents innovation in this field and underscores the pressing need for
beled data and numerous unlabeled instances, propagating information
advancements in short text classification techniques.
through auto-generated graphs. However, it lacks interconnections be-
tween nodes of the same type, limiting its ability to capture document The remainder of the paper is structured as follows: Section 2
similarity and propagate labels. Thus, utilizing the limited labeled data reviews previous work. Section 3 describes our proposed method and
remains a significant challenge. model, including building HIG for short texts, edge-enhanced methods,
To address the problems above, we propose a novel Edge-Enhanced and graph attention network models. In Section 4, we perform compre-
Minimum-Margin Graph Attention Network (EMGAN) for short text hensive experiments on the datasets and analyze the outcomes. Lastly,
classification. This method cleverly combines the edge enhancement Section 5 concludes the paper, offering insights into the future research
technology and the minimum margin graph attention mechanism, directions.

2
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Fig. 2. The overall framework of the EMGAN model consists of heterogeneous information graph construction, graph edge augmentation, and minimum margin graph attention
network.

2. Related work (RNN), and convolutional neural networks (CNN). For instance, Wang
et al. (2019) introduced a bidirectional RNN model enriched with an
This section presents an overview of pertinent literature concerning attention mechanism for short text classification. This model finds ap-
classifying concise textual content, encompassing both conventional plications in health monitoring and the automated filtration of health-
approaches and deep neural network methodologies. Subsequently, we related tweets. RNN can effectively capture bidirectional information
delve into contemporary research that explores the utilization of graph in sequence data, but it is prone to disappearance and explosion of
neural networks for short text classification, focusing on the current gradients during the training process. LSTM can handle this problem
state of affairs. better. Li et al. (2022) have devised a versatile distributed LSTM net-
work that accommodates large-scale, high-velocity short text streams.
2.1. Traditional short text classification However, it may struggle to capture complex patterns in lengthy se-
quences. In addition, because CNN can effectively capture local features
Text classification refers to extracting features from raw textual and patterns, it also performs well in local feature extraction for
data and predicting categories for text data. Over the past several tasks such as text classification. Zhou et al. (2022) have ingeniously
decades, researchers have introduced a multitude of models (Flisar & devised a multichannel convolution framework based on CNN, thereby
Podgorelec, 2020), including traditional machine learning algorithms generating feature maps of diverse scales and facilitating the capture
such as NB (Lu, Chiang, Keh, & Huang, 2010; Xia, Wang, Chen, Duan, of semantic features spanning various dimensions. However, CNNs
et al., 2018), Support Vector Machines (SVM) (Xia et al., 2020) and struggle to effectively capture long-range dependencies in lengthy text
K-means (Joachims, 2005; Zhang, Yoshida, & Tang, 2008). However, sequences due to their inherent local perception mechanism and fixed
traditional methods encounter a significant challenge of feature spar- window size. The introduction of the Transformer architecture has
sity when dealing with short texts. Recent studies (Rousseau, Kiagias, effectively mitigated this issue. Bert, through its bidirectional encoding
& Vazirgiannis, 2015; Wang, Song, Li, Zhang, & Han, 2016) have mechanism, comprehensively parses input text, capturing both local
employed graphical representations of text and extracted path-based and global information, thus delving deeper into understanding the
features for text classification. Despite their initial success in formal contextual relationships within the text. Cui, Wang, and Yu (2023)
texts, these approaches often fail to deliver satisfactory performance used a fusion model combining Bert and TextRNN. The Bert model
due to the inadequacy of short text features. In order to address uses the deep bidirectional Transformer component to build the entire
this issue, many domestic and international researchers employ ex- model, thereby ultimately generating a deep bidirectional language
ternal corpora or leverage associated internal semantic information representation that can integrate the context of both parties. However,
to enhance the features of short texts. For instance, Phan, Nguyen, the BERT model has certain restrictions on the length of the input
and Horiguchi (2008) harnessed external corpora to extract latent text, which usually requires truncation or padding, which may result in
themes from short texts. Wang, Chen, Jia, and Zhou (2013) introduced the loss or redundancy of text information and affect the performance
external entity information from the Wikipedia knowledge base to of the model. Moreover, it cannot establish relationships across texts.
represent text. Yao, Bi, Huang, and Zhu (2015) enriched short text with The multi-stage attention model can weighted average the importance
semantic similarity information. However, these model architectures of different positions, effectively handle variable-length sequences and
are relatively straightforward and have failed to fully unearth the capture long-distance dependencies. Meanwhile, Liu, Li, and Hu (2022)
latent characteristics of short texts, hence yielding limited classification introduced a multi-stage attention model, amalgamating TCN and CNN,
efficacy. enhancing the model parallelism and overall efficiency. These innova-
tive approaches have yielded commendable results in a multitude of
2.2. Deep neural networks for short text classification NLP tasks. Nonetheless, there are still problems such as loss of useful
information, relative complexity, and high computational consumption.
In recent years, with the continuous advancement of deep learn-
ing, text classification based on deep learning techniques has gradu- 2.3. Graph neural networks for short text classification
ally emerged as the prevailing trend in natural language processing
tasks (Wang, Wang, Zhang, & Yan, 2017). The most prominent ad- Short text classification involves categorizing concise content using
vantage of deep learning methods over traditional text classification machine learning and data mining. Unlike extended text classification,
approaches lies in their efficient handling of text representation issues, it is more challenging due to length constraints. Short texts lack signifi-
enabling a more precise capture of textual features and achieving end- cant contextual details and strict syntactic structures, which are crucial
to-end problem resolution. The current explores various methodolo- for comprehensive text understanding (Wang et al., 2017). Therefore,
gies grounded in deep learning principles, including models based on methods customized for short text classification strive to integrate
long short-term memory networks (LSTM), recurrent neural networks various auxiliary information to enrich short text representation. The

3
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

continuous development of graph neural networks (GNNs) has achieved 3. Proposed method
the latest performance on short text classification. Here, we introduce
the short text classification models in graph neural networks proposed 3.1. The design of the EMGAN structure
in recent years. First, Defferrard, Bresson, and Vandergheynst (2016)
proposed the usage of convolutional neural networks (CNNs) on graphs In this section, we detail the design of the EMGAN structure. Fig. 2
by treating text data as graph structures and applying local spectral visually shows the architecture of the EMGAN model proposed in
filtering techniques. This approach significantly reduces computational this paper. Our model includes three key stages: (1) Heterogeneous
complexity and has achieved notable results in text classification tasks. information graph construction: In order to better represent the features
Subsequently, based on graph neural networks, Yao, Mao, and Luo of short texts, we utilize a heterogeneous information graph. Specifi-
cally, we first use a part-of-speech tagger to mark the part-of-speech
(2019) designed a short text classification method that models words
(POS) of each word in the short text and then use different attributes
and texts as nodes in a graph, formulating text classification as a
(document, entity, topic, keyword) to model the short text as nodes
node classification problem. This approach can comprehensively in-
and construct edge relationships to form heterogeneous information
tegrate global information among texts, enhancing the understanding
graphs. (2) Graph edge enhancement: We propose an edge enhance-
of text semantics and context. It adapts well to unstructured and
ment method based on the 𝑋-shaped structure, which reconstructs the
irregular text data and is one of the earliest papers to propose this edge structure between nodes, enriches edge relationships, enhances
method. However, it only utilizes word semantic similarity information the connectivity of the global topology of heterogeneous graphs, and
to enrich document representation, which is not enough for sparse forms multi-dimensional complex network relationships. (3) Minimum
short texts. Furthermore Ye, Jiang, Liu, Li, and Yuan (2020) found margin graph attention mechanism: We design a novel model (MM-
that the semantic information of word node representation and word GAN) that uses a minimum margin graph attention mechanism to
order is very useful in short text classification. They developed a embed heterogeneous information graphs for short text classification.
short text graph convolution network (STGCN) based on words, doc- MMGAN can utilize information propagation along the graph to explore
ument relationships and text topic information, and combined the the structure of heterogeneous graphs at the lowest cost, fully utilize the
nodes into The representation is merged with word embeddings ob- characteristics of various types of nodes to integrate short text features,
tained by pre-training BERT. Yang et al. (2021) noticed the importance address the issue of sparse features in short texts, and attain superior
of attention mechanisms and proposed HGAT based on the double- outcomes in the classification of such texts.
layer attention mechanism, supplemented by additional relations and
external knowledge bases to classify short texts. External knowledge 3.2. Heterogeneous information graph for short texts
bases are very helpful for short text classification because they can
provide more features and initial knowledge for short texts, thereby Due to the short text, there are problems such as short data and
elevating the precision of the model. However, it should be noted sparse features. In the case of such discontinuous vocabulary, modeling
text as a graph structure is helpful for mutual learning of feature infor-
that using external knowledge bases can also cause noise interference.
mation between nodes. It transforms text classification tasks into graph
Recently, Jin, Sun, and Ma (2022) developed a concise method for
classification tasks. Inspired by the model SHINE (Wang, Wang, et al.,
short text classification using a dual-channel hypergraph convolutional
2021), but different from it, we also introduce document and topic fea-
network. This approach effectively learns two different representations
tures to represent short texts. Specifically, our graph construction first
of short text features. It enhances text embedding through an attention
uses part-of-speech taggers to reduce errors caused by ambiguity, then
network, improving computational efficiency. Wu (2023) proposed a takes document features as central nodes, uses multiple attributes such
new heterogeneous graph attention network based on HGAT. The prior as topics, entities, and keywords as nodes to compensate for missing
knowledge introduced in HIN enhances the semantic representation of features, and flexibly builds relationships between nodes. Specifically,
short texts. Hua et al. (2024) integrated heterogeneous graph convolu- entities serve as the subjects of events and can encapsulate rich infor-
tional neural networks of text, entities and words, represented features mation. The majority of entities possess information intrinsically tied
through word graphs, enhanced word features through BiLstm, and to their respective domains. For example, the entity ‘‘Microsoft’’ often
predicted document categories. However, due to length constraints, appears in the technology field. Topics represent the primary subjects
GNNs, when dealing with short texts, typically do not consider adding or themes of discussion, providing insights into the central focus of the
additional information. Instead, they treat each short text as a single text. Keywords, important terms, or phrases highlight key information
node in the graph, resulting in insufficient information features and and aid in summarization and indexing. These elements enrich the
poor performance. Despite most of these methods utilizing graphs to representation of short text features, provide contextual understanding,
model texts, they neglect the influence of graph structure on short and improve the differentiation between texts with similar features,
text attribute relationships. They also overlook the relevance of overall thus compensating for feature deficiencies. They address the limitations
content features when dealing with feature attributes and relationships of short text features by capturing different aspects of content and
between texts, resulting in inadequate connections between nodes. To flexibly integrating rich relationships. The specific construction process
address the challenges of short text classification, we employ multiple of the heterogeneous information graph is shown in Fig. 3.
Here, we consider constructing a heterogeneous information graph
features as nodes, enhance edges to enrich their relationships, and
𝐺 = (𝑉 , 𝐵) consisting of documents, entities, keywords, and topic
finally utilize efficient exploration and aggregation mechanisms.
nodes, where 𝑉 = {𝑣1 , … , 𝑣𝑛 } and 𝐵 = {𝑏1 , … , 𝑏𝑚 } represent sets
Unlike the above existing studies, in this article, we address the
of nodes and edges respectively, 𝑛 is the number of nodes, and 𝑚 is
issue of feature sparsity in short text classification by constructing a the number of edges. In the node set of graph 𝐺, documents, entities,
heterogeneous graph for short text corpora and using edge enhance- keywords, and topic nodes are represented by 𝐷 = {𝑑1 , … , 𝑑𝑎 }, 𝐸 =
ment methods to rebuild relationships between nodes and enhance edge {𝑒1 , … , 𝑒𝑦 }, 𝑊 = {𝑤1 , … , 𝑤𝑟 } and 𝑇 = {𝑡1 , … , 𝑡𝑔 } respectively. In the
structures, thereby obtaining higher-order relationships. We propose a short text, entities, keywords, and topic nodes are all connected to the
novel EMGAN model for classification that fully explores the structure document node (central node). The features of the central node are
of the heterogeneous graphs, further aggregates the adequate infor- obtained by encoding the short text through the RoBERTa model. The
mation of distant neighbor nodes into the attention mechanism, and construction of other nodes and edges is described in detail below.
dynamically extracts the critical characteristics of short texts instead of First, in short texts, different parts of speech can create ambiguity.
directly processing the entirety of the information. For instance, ‘‘uniform’’ can be categorized as ‘‘clothing’’ if used as a

4
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Fig. 3. Illustration of heterogeneous information graph for short text. (a) Use NLTK’s to tag the words in the short text. (b) Use RoBERTa to represent the short text. (c) Extract
and represent entities, topics, and keywords in short texts. (d) Construct the relationship between each feature to form a heterogeneous information graph.

noun. However, it does not belong to the ‘‘clothing’’ category if used PMI is positive, the keywords correlate more in the corpus, and edges
as a verb. In order to eliminate ambiguity, we use a POS tagger to are created between keywords with positive PMI values. Formally, the
assign POS tags to the words in short texts, which are syntactic affixes weight of the edge between node 𝑖 and node 𝑗 is defined as:
such as nouns and verbs that mark each word in the short text. In
⎧(𝑖, 𝑗), 𝑖, 𝑗 are keywords;
particular, we utilize NLTK’s default part-of-speech tagging to obtain ⎪
the part-of-speech tags of each word in the document, resulting in a ⎪𝑇 𝐹 − 𝐼𝐷𝐹𝑖𝑗 , 𝑖 is document, 𝑗 is keyword;
𝐻𝑖𝑗 = ⎨ (3)
set of part-of-speech tag nodes 𝑉 ′ = {𝑣′1 , … , 𝑣′𝑛 }. We splice entities, ⎪1, 𝑖 = 𝑗;
keywords, and topics with corresponding parts of speech to eliminate ⎪0, otherwise.

semantic ambiguity.
Second, entities 𝐸 in document 𝐷 need to be identified to establish The calculation method for the PMI value is as follows:
more prosperous edge relationships. Compared to the many keywords (𝑖, 𝑗)
(𝑖, 𝑗) = 𝑙𝑜𝑔 (4)
and topics in the document, the quantity of entities is considerably (𝑖)(𝑗)
smaller, as most short documents encompass a single entity. We chose 𝛬(𝑖, 𝑗)
the entity-linking tool TAGME, which performs well in short texts. (𝑖, 𝑗) = (5)
𝛬
Using TAGME to link entities to Wikipedia, we obtain a set of entity
𝛬(𝑖)
nodes 𝐸 = {𝑒1 , … , 𝑒𝑦 }. If a document contains entities, we establish (𝑖) = (6)
𝛬
edges between the document and the entities. We then leverage the
where 𝛬(𝑖, 𝑗) represents the number of sliding windows that contain
classic text embedding model word2vec to learn entity embeddings and
both word 𝑖 and word 𝑗, 𝛬(𝑖) signifies the count of sliding windows
measure the cosine similarity between each entity in all short texts. We
containing word 𝑖, and 𝛬 denotes the total number of sliding windows
predefine a threshold 𝛿, and when the similarity is more remarkable
contained in the entire corpus.
than 𝛿, we build edges between them and merge the two entity node
By using multiple attributes such as topic, entity, document, and
information.
keyword as nodes and specific relationships as edges to construct
Third, we employed the LDA (Blei, Ng, & Jordan, 2003) topic model
a heterogeneous graph, more abundant feature information can be
to extract latent topic 𝑇 , as shown in Eq. (1). Topic modeling is a sta-
obtained, thereby compensating for the semantic shortcomings of short
tistical model that clusters data based on the latent semantic meaning.
texts and playing an important role in subsequent classification tasks
This can help us enrich semantic relationships, especially by identi-
(see Fig. 4).
fying latent words within documents or finding connections between
{ }
similar documents without common words. The top 𝑡𝑖 = 𝜖1 , … , 𝜖𝑧
3.3. 𝑋-Shaped structure graph edge enhancement method
(where 𝑧 represents the size of the lexicon) constitutes a conditional
probability distribution across a collection of words. In order to avoid
In the realm of heterogeneous graphs, we amalgamate a plethora of
the interference of noise, We choose the foremost 𝐹 words with the
attributes as nodes to enrich the informational fabric of the graph. Nev-
utmost probabilities as the topic words and allocate the document
ertheless, the brevity and sparsity of features in short textual data ren-
to these words of elevated likelihood. When assigning documents to
der such efforts insufficient. Furthermore, most heterogeneous graphs
topics, we establish edge relationships between documents and topics.
exclusively contemplate the characteristics of low-level neighboring
For document 𝑑, the class label 𝑡𝑑 can be predicted as the topic with
nodes, thus failing to augment higher-level information. Therefore, we
the highest probability:
∑ propose an edge enhancement method for heterogeneous graphs based
𝑃 (𝑤|𝑑) = 𝑃 (𝑤|𝑡) ∗ 𝑃 (𝑡|𝑑), (1) on an 𝑋-shaped structure. The enhancement process is illustrated in
𝑑 Fig. 4. The core idea is as follows: Initially, within the constructed
𝑡𝑑 = arg max 𝑃 (𝑤𝑖 |𝑑). (2) heterogeneous graph, we provide the following definition: An 𝑋-shaped
𝑖 structure is a substructure of a heterogeneous graph with a central
Fourth, we extract keywords from the tagged short texts to form node that connects to at least three different types (topics, entities,
a set of keyword nodes 𝑊 = {𝑤1 , … , 𝑤𝑟 }. We establish edges based and keywords) of four nodes, forming a structure resembling the letter
on the inclusion relationship between documents and keywords. To ‘‘𝑋’’. The purpose of the 𝑋-shaped structure is to establish edge rela-
extract keywords, We employ the term frequency-inverse document tionships between two different 𝑋-shaped structures when they share
frequency (TF-IDF) computation technique, wherein the term frequency connections of the same node type. This maximizes the connectivity of
denotes the frequency of a word occurrence within a document. In con- nodes with feature relevance between the two central nodes. At this
trast, inverse document frequency represents the logarithmically scaled point, the information of the two central nodes can complement each
reciprocal fraction. To establish edges between keywords that have other as features, and the original set of edges is merged with the new
co-occurrence relationships, We employ pointwise mutual information set to form a new total edge set. This structure facilitates the flow of
(PMI) to compute the weighting factor between two keywords. When feature information from nodes of other 𝑋-shaped structures toward

5
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Fig. 4. Illustration of the edge enhancement method for heterogeneous information graph. (a) Identify the 𝑋-shaped structure in the heterogeneous information graph with the
central node as the unit. (b) To measure the similarity between 𝑋-shaped structures, add an edge to the central node between similar structures.

its central node, aiding in capturing the multiple associations between that can better supplement feature information and possess higher-
short texts, topics, entities, and keywords. It enhances the connectivity order structural capabilities.
of the heterogeneous graph’s topological structure, enriching the fea- Next, we shall refer to the set of nodes in the 𝑗th 𝑋-shaped structure
{ }
ture information on the graph. This is vital for a better comprehension as 𝑉 ′ 𝜙𝑗 = 𝑡𝑗 , 𝑒𝑗 , 𝑤𝑗 . If ∀(𝑡𝑖 , 𝑒𝑖 , 𝑤𝑖 ) ∈ 𝑉 ′ 𝜙𝑖 , we establish an edge relation-
of the contextual and feature aspects of the text in classification tasks, ship between the two structures, allowing the feature information of
ultimately leading to improved accuracy and effectiveness. This prepa- the 𝑖th and 𝑗th structures to complement each other. We denote this
ration sets the stage for capturing higher-order feature information set of edges as follows:
for the model to be introduced in the following section. We will now ′ {( ) }
explore how to employ edge enhancement techniques to enhance graph 𝐵 𝑋 = 𝑘, ̄ 𝑙̄ |∀𝑘,̄ 𝑙̄ ∈ 𝑋𝑗 , ∀𝑗 = 1, … , 𝑚𝛷 , (12)
construction. ̄ 𝑙̄ ∈ 𝑉 ′ represent the two endpoints of an edge between the
where 𝑘,
We first perform high-order encoding connections by constructing ( { })
𝑖th and 𝑗th 𝑗 ∈ 1, … , 𝑚𝛷 central nodes in 𝑋-shaped structures.
an 𝑋-shaped adjacency matrix 𝐴𝑋 , where (𝐴𝑋 )𝑖𝑗 is the number of
We divide the 𝑋-shaped connected structure 𝜙𝑄 ∈ 𝛷 that has
𝑋-shaped structure instances containing nodes 𝑖 and 𝑗. The network
diagram is represented as follows: reconstructed edge relationships into the same modules. We choose
{ } Louvain (Blondel, Guillaume, Lambiotte, et al., 2008) as the module
𝐺𝑋 = 𝑉 ′ , 𝐵 𝑋 , (7) division method, and the input is each 𝑋-shaped connected structure
𝜙𝑄 . We represent the modularization 𝑆 (Newman & Girvan, 2004) as:
where 𝐺𝑋 represents a heterogeneous graph based on the 𝑋-shaped
structure, 𝑉 ′ represents the same set of nodes as the original heteroge- 1 ∑ 𝛾𝑖 𝛾𝑗
𝑆= (𝐴𝑖𝑗 − )(𝜕 + 1)
neous graph, and 𝐵 𝑋 is the weighted edge set generated based on the 4𝜆 𝑖𝑗 2𝜆
(13)
𝑋-shaped structure: 1 ∑ 𝛾𝑖 𝛾𝑗
{ { }} = 𝜕(𝐴𝑖𝑗 − ),
𝐵 𝑋 = (𝑘, 𝑙, 𝜂)𝑖 |𝑖 ∈ 1, … , 𝑚𝑥 , (8) 4𝜆 𝑖𝑗 2𝜆
{ } ∑
where 𝑘, 𝑙 ∈ 𝑉 ′ are the two endpoints of the 𝑖-th (𝑖 ∈ 1, … , 𝑚𝑥 ) edge where 𝜆 = 12 𝑖 represents the total number of edges in the network,
and 𝜂 represents the weight. 𝛾𝑖 and 𝛾𝑗 denote the degrees of the 𝑖th and 𝑗th structural hub nodes,
𝛾𝑖 𝛾𝑗
Next, we will first identify the connected structures based on the and 2𝜆 signifies the expected number of connections between these
heterogeneous graph above, and any set of 𝑋-shaped connected struc- two structures. 𝜕 = ∀[(𝑒,𝑡,𝑤)∩(𝑒∪𝑡∪𝑤)]
∑ represents the probability that two
𝑉 ′ >4 (𝑒,𝑡,𝑤)
tures are represented as follows: 𝑋-shaped structures share a common attribute, and if 𝜕 ≥ 1, the two
{ } structures can be linked and belong to the same module; if 𝜕 < 1,
𝛷 = 𝜙𝑖 , (9)
then they do not belong to the same module. 𝐴𝑖𝑗 is the element in
{ } the adjacency matrix between central nodes 𝑖 and 𝑗 (the number of
𝜙
𝜙𝑖 = 𝑉 ′ 𝑖 , 𝐵𝑖𝑋 , (10)
connecting edges between central nodes 𝑖 and 𝑗).
𝜙 { } The output is a module 𝑆 composed of several connected structures
𝑉 ′ 𝑖 = 𝑡𝑖 , 𝑒𝑖 , 𝑤𝑖 , (11) 𝜙𝑄 ∈ 𝛷, put all modules together to obtain a module set, which we
{ } { }
where 𝛷 represents the total node-set, 𝜙𝑖 (𝑖 ∈ 1, … , 𝑚𝛷 ) represents denote as 𝑆1 , … , 𝑆𝑠̄ , 𝑠̄ is the number of all modules obtained by the
the set of nodes and edges for the 𝑖th structure, 𝑉 ′ 𝜙𝑖 ⊆ 𝑉 ′ is the set of fusion of 𝑋-shaped connected components.
nodes in the 𝑖th connected structure, which includes entities 𝑒𝑖 , topics 𝑡𝑖 , Finally, we perform a reconstruction of the relationships between
and words 𝑤𝑖 , and contains at least four nodes of three different types, nodes by strengthening the connectivity structure of each module in
{ }
and 𝐵𝑖𝑋 ⊆ 𝐵 𝑋 is the weighted edge set of the 𝑖th connected component. the set 𝑆1 , … , 𝑆𝑠̄ , thereby enhancing the edge relationships and sup-
The nodes that make up the 𝑋-shaped connected component rep- plementing high-order structures with low-order structures to reinforce
resent feature attributes in the document and are connected by edges. the topological structure of the graph. We use an 𝑋-shaped structure
This connected structure is a stable, 𝑋-shaped connected component with better transitivity for connectivity. For nodes in the same module

6
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Fig. 5. Illustration of the minimum margin graph attention mechanism. (a) Computes the minimum margins from higher-order neighbor nodes (entity, keyword, and topic nodes)
to other central nodes. (b) Calculate the attention coefficient from the high-order neighbor node to the central node according to the minimum margin.

{ }
̄ we allow feature information to com-
𝑆𝑖 ∈ 𝑆1 , … , 𝑆𝑠̄ (𝑖 ∈ {1, … , 𝑠}), Finally, we iteratively update the features of each document using
plement each other, thus constructing a new set of edges, represented margin features and attention coefficients, aggregating information
as: from other nodes to the center node.
∗ ′ In addition, the features of each node in our model are only related
𝐵𝑚𝑜𝑑 = 𝐵𝑋 ∪ 𝐵𝑋 , (14)
to the graph of topological structure. They are independent of the
Based on the new edge set ∗ ,
the edge relationships of the
𝐵𝑚𝑜𝑑 order of the node embedding features and neighboring nodes. During
original graph structure are strengthened to form a new network con- aggregation, the model relies on nodes and explores the minimum
nectivity graph, and the documents with high feature correlation are distance from the document node. Next, we will provide a detailed
maximally connected, expressed as: explanation of the model.

{ }
𝐺𝑚𝑜𝑑 = 𝑉 ′ , 𝐵𝑚𝑜𝑑

, (15)
3.4.1. Minimum margin search and sampling
First, our input consists of the minimum margin 𝑅 and node features
3.4. Minimum margin graph attention network
ℎ. In the initial stage, the minimum margin is 𝑅 with uniform edge
weights. This is done to minimize the loss of specialized tasks, such
By performing edge enhancement on heterogeneous graphs, the
as cross-entropy loss in classification tasks. After training, the attention
connections between topological nodes in the graph are made com-
function generates edge weights based on learned attention coefficients.
plete, which provides a supplement for the problem of sparse feature
Then, the minimum margin is calculated using Dijkstra’s algo-
representation in short texts. However, it is worth considering how to
rithm (Dijkstra, 2022), where the edge weights are first inverted and
explore this high-order feature information. Most classification models
then transformed into positive values using the Suurballe method
only consider low-order feature information in the network, which
(Sidhu, Nair, & Abdallah, 1991). After computation, different attention
cannot capture the high-order features in the graph. Although the
coefficients have varying impacts on the edges. To ensure the stability
graph has rich information, it cannot be captured at a high-order level,
of edge weights, we choose the attention coefficients of the network’s
resulting in a significant performance loss. For example, in traditional
last layer and take the average of all attention coefficients.
models (Kipf & Welling, 2016), only the information of low-order nodes
1 ∑ (𝑝)
𝑃
within a single layer is examined. The most common approach to
capturing features of high-order neighbors is to stack multiple layers 𝜂𝑖𝑗 = 𝑓𝛼 (16)
𝑃 𝑝=1 𝑖𝑗
to expand the field of view. However, experimental results have shown
that stacking multiple layers in the GAT model fails to expand the field where 𝑃 represents the number of attention heads in a layer, 𝑓 denotes
of view and leads to performance degradation. the final layer, 𝛼𝑖𝑗(𝑝) refers to the attention coefficient from node 𝑖 to
Therefore, this paper proposes a minimum margin graph atten- node 𝑗 in the 𝑝th attention head, and 𝜂𝑖𝑗 represents the edge weight
tion network model that captures high-order topological features. The from node 𝑖 to node 𝑗.
model can perform complete walks and exploration in heterogeneous Let 𝑅𝑐𝑖𝑗 represent the minimum edge distance of length 𝑐 between
graphs with high-order topological structures, finding the minimum nodes 𝑖 and 𝑗, where 𝑐 is the length of an arbitrary edge, and let ℜ
distance between the central node and other attribute nodes, even for represent the set of such distances. The document nodes themselves
distant nodes, with the minimum cost. Then, the attention coefficients are added to the set ℜ. Within an edge distance of length 𝑐, we allow
of the node distance and features are calculated for updating. By the document nodes to access nodes up to 𝑐 hops away so that the
applying the minimum margin graph attention network, we can explore maximum value of 𝑐 can be used to control the size of the single-layer
other nodes in the graph structure to obtain feature information and visual field.
effectively aggregate this information into the central node, supple- For edges with the same minimum edge distance, those with higher
menting the short text content and improving the accuracy of short text costs in heuristics are less correlated with the document’s features. In
classification tasks. comparison, those with lower costs are more correlated. We sample
The overall process of the minimum margin attention mechanism the first 𝑝 edges for a given central node and use the minimum cost,
is shown in Fig. 5. Firstly, we select the document node 𝐷 in the reducing computational pressure and highlighting the importance of
heterogeneous graph as the center node. For each center node, we com- more relevant edge distances. We represent the set of all sample edge
pute the minimum margin 𝑅 of the high-order neighbors (keywords, distances as:
topics, entities) of other center nodes to the center node with different
lengths and extract their connection features as margin features. Then, ℑ𝑐𝑖 = 𝑡𝑜𝑝𝑝 (ℜ𝑐 ) , (17)
we utilize the minimum margin attention mechanism to calculate the
attention coefficients of these high-order neighbors to the center node. 𝑝 = 𝜑𝑖 ∗ 𝜇, (18)

7
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

where ℑ𝑐𝑖 represents the set of all sample distances with a length At the initial stage, the entire network is updated iteratively based
𝑐 centered around node 𝑖, 𝜑𝑖 denotes the degree of node 𝑖, and 𝑝 on ℎ and 𝑅, with 𝑅 generated using equal edge weights. As the network
is determined by the degrees of the document nodes, ensuring the converges, 𝑅 is regenerated based on the attention of the final layer,
comparability of embedded features from distances of varying lengths. which is used for the next iteration.
ℜ𝑐 signifies a subset of 𝑅, encompassing all the shortest distances of After going through an 𝑓 -layer EMGAN, We feed the obtained
length 𝑐. 𝜇 is a hyperparameter, representing the ratio between the final embedding 𝐽 of short text into a softmax layer for classification.
number of sample distances and the degree of document nodes. Formally,

𝑍 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝐽 (𝑓 ) ), (23)


3.4.2. Aggregation of margin information
Margin aggregation is the cornerstone of our model. By meticulously During the model training process, we minimize the model’s loss
exploring minimal margins, we select feature nodes that exhibit the using the cross-entropy loss function while employing L2 regularization
lowest cost in proximity to document nodes while maintaining a high to prevent model overfitting:
degree of feature relevance. Subsequently, we aggregate the feature
∑ ∑
𝑂
information from these diverse nodes into the document nodes. This 𝐿=− 𝑌𝑖𝑗 ⋅ 𝑙𝑜𝑔𝑍𝑖𝑗 + 𝛶 ‖𝛹 ‖2 , (24)
process enables capturing more intricate topological information by 𝑖∈𝐷𝑡𝑟𝑎𝑖𝑛 𝑗=1
accommodating varying lengths of the shortest margins. Consequently,
where 𝑂 is the number of classes, 𝐷𝑡𝑟𝑎𝑖𝑛 corresponds to the training
it augments the features of short texts. To this end, we have de-
dataset, 𝑌𝑖𝑗 denotes the corresponding label matrix, 𝛹 stands for model
vised a dual-layer attention-based margin aggregation mechanism that
parameters, and 𝛶 represents the regularization factor.
addresses attention to identical and disparate edge distances. With
attention to the same margin, for each document node 𝑖 and the set of
4. Experiments
shortest margins ℑ𝑐𝑖 , we aggregate the features of each shortest margin
of length 𝑐 and represent the aggregated features as:
To validate the availability and accuracy of our classification met-
⎧ ⎫ hod, we conducted experiments on five real-world datasets and eleven
∑ (𝑝) ( )
𝑃 ⎪ (𝑝) ⎪ baselines. This section describes the experimental setup, including the
𝜁𝑖𝑐 = 𝛩𝑝=1 ⎨ 𝛼𝑖𝑗 ∫ 𝑅 𝑖𝑗 ⎬ , (19)
⎪𝑅𝑐𝑖𝑗 ∈ℑ𝑐𝑖 ⎪ benchmark datasets, baseline algorithms, and parameter settings. Then,
⎩ ⎭ we compare methods and analyze the results.
𝜁𝑖𝑐 is the aggregated feature of node 𝑖 concerning ℑ𝑐𝑖 , where ℑ𝑐𝑖 is the
shortest edge distance of length 𝑐 centered at node 𝑖. The operator 𝛩 4.1. Experimental setup
represents the concatenation of all intermediate layer connections and
the final layer averaging operation, which calculates the mean feature 4.1.1. Datasets
of all nodes in the edge distance. 𝑃 is the number of attention heads In order to thoroughly assess the efficacy of our approach, we jux-
for all edge distances of the same length 𝑐, and ∫ maps edge distances tapose it against cutting-edge methods in diverse scenarios. The evalu-
of different lengths to a fixed length. 𝛼𝑖𝑗(𝑝) is the attention coefficient ation encompasses five datasets: TagMyNews, Snippets, Ohsumed, MR,
between node 𝑖 and edge distance 𝑅𝑐𝑖𝑗 , which can be expressed as: and Twitter. Table 1 provides a detailed depiction of these datasets.
( (( ) )) • TagMyNews: This dataset contains 32,600 news articles collected
𝛼𝑖𝑗(𝑝) =𝜏 ℎ⃗′𝑖 , 𝑅𝑐𝑖𝑗 |𝜃𝛼
∫ from RSS feeds in English (Vitale, Ferragina, & Scaiella, 2012).
( [ ( )]) The dataset has been filtered to exclude all titles. It includes ar-
𝑒𝑥𝑝 𝜎 𝜃𝛼 , ℎ⃗′𝑖 ∥ ∫ 𝑅𝑐𝑖𝑗 (20)
ticles from seven categories: sports, business, US, entertainment,
=∑ ( [ ( )]) ,
𝑒𝑥𝑝 𝜎 𝜃𝛼 , ℎ⃗′𝑖 ∥ ∫ 𝑅𝑐𝑖𝑗 world, health, and Sci.
⃗′
ℎ𝑖 ∈ℑ𝜃𝛼
• Snippets1 : This dataset is published by Phan et al. (2008) search
where 𝜏 represents the attention function, which outputs the attention fragments returned by Web search engines, consisting of 12,340
between node feature ℎ and the minimum edge margin 𝑅. ℎ⃗′𝑖 refers to short texts, divided into business, computer, health, sports, cul-
the linearly transformed features of sample node 𝑖, while 𝜃𝛼 denotes ture and arts, education and science, engineering, politics and
the parameters of the defined attention function 𝜏. When we set 𝑐 = 2, society eight categories.
the generated attention coefficients are equivalent to the node attention • Ohsumed2 : This dataset is a medical dataset mixed with 7400
that can be used to update edge weights. 𝜎 denotes any non-linear op- single-label samples and 6529 multi-label samples (Yao et al.,
eration, and ∥ represents concatenation. In the first level, ℑ𝜃 represents 2019). We only used titles for short text classification, and doc-
the set ℑ𝑐𝑖 . The above is an aggregation of the same margins from the uments with multiple labels were removed, including 23 cardio-
first layer. vascular disease categories.
The second layer focuses on variations in margin features of dif- • MR3 : This dataset constitutes an English movie review corpus
ferent lengths, utilizing an attention mechanism to capture embedded employed for binary sentiment classification (Pang & Lee, 2005).
features of document nodes: It encompasses two distinct categories: positive and negative sen-
{𝐶 } timents, comprising 5,331 affirmative reviews and an equivalent

⃗ 𝑐 number of pessimistic reviews, with an average sentence length
ℎ𝑖 = 𝜎 𝛽𝑐 𝜁𝑖 , (21)
𝑐=2 of 20.
where we set 𝑐 = 2, the attention coefficient generated at this time is • Twitter4 : It is a dataset of English tweets designed for binary
equal to the node attention that can be used to update the edge weight, sentiment classification. It consists of 5,000 positive and 5,000
𝐶 is the maximum allowed edge distance, and 𝜁𝑖𝑐 is the aggregation negative tweets, allowing for the evaluation of our model’s clas-
feature of node 𝑖 with edge distance 𝑐. 𝛽𝑐 is the attention coefficient of sification capability on social media.
𝜁𝑖𝑐 , which we express as:
( )
𝛽𝑐 = 𝜏 ℎ⃗𝑖 , 𝜁𝑖𝑐 |𝜃𝛽 , (22) 1
SnippetsandTagMyNewsaredownloadedfromhttp://acube.di.unipi.it:
80/tmn-dataset/
it can be derived from the identical attention mechanism at document 2
http://disi.unitn.it/moschitti/corpora.htm
node 𝑖 by the attention function 𝜃𝛽 in this layer, where ℑ𝜃 represents 3
http://www.cs.cornell.edu/people/pabo/movie-review-data/
the collection of aggregate features ℑ𝑐𝑖 for all nodes 𝑖 regarding 𝜁𝑖𝑐 . 4
http://www.nltk.org/howto/twitter.html#corpus_reader

8
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Table 1
Summary statistics of datasets.
#Docs #Classes #Avg.Length #Words #Docs with entities #nodes #edges #X-structures
TagMyNews 32,549 7 5.1 38,629 86% 64,557 425,391 26,853
Snippets 12,340 8 14.5 29,040 94% 57,105 371,538 10,389
Ohsumed 7,400 23 6.8 11,764 96% 33,992 214,165 6,864
MR 10,662 2 7.6 18,764 76% 35,853 264,754 8,683
Twitter 10,000 2 3.5 21,065 65% 49,547 325,186 7,129

Table 2
Test accuracy (ACC) and Macro-F1(F1) of different models on five standard datasets. The best results are highlighted in bold.
Model Metrics TagMyNews Snippets Ohsumed MR Twitter
ACC 28.76 48.34 35.25 54.85 52.58
CNN-rand
F1 15.82 42.12 13.95 51.23 51.91
ACC 57.12 77.09 32.92 58.32 56.34
CNN-pretrain
F1 45.37 69.28 12.06 57.99 55.86
ACC 25.89 30.74 23.30 53.13 54.81
LSTM-rand
F1 17.01 25.04 5.20 52.98 53.85
ACC 53.96 75.07 29.05 59.73 58.20
LSTM-pretrain
F1 42.14 67.31 5.09 59.19 58.16
ACC 54.28 77.82 41.56 59.12 60.15
TextGCN
F1 46.01 71.95 27.43 58.98 59.82
ACC 61.72 82.36 42.68 62.75 63.21
HGAT
F1 53.81 74.44 24.82 62.36 62.48
ACC 34.74 70.01 33.91 58.18 64.33
STGCN
F1 34.01 69.93 27.22 58.11 64.29
ACC 62.50 82.39 45.57 64.58 72.54
SHINE
F1 56.21 81.62 30.98 63.89 72.19
ACC 63.44 83.45 46.07 64.81 73.24
STHCN
F1 56.28 78.17 31.28 64.44 73.01
ACC 65.43 85.78 46.42 68.44 75.23
ST-Text-GCN
F1 58.72 80.63 32.14 66.54 74.36
ACC 69.08 87.54 40.69 61.73 68.44
Bert+ TextRNN
F1 61.53 84.31 19.37 60.42 66.95
ACC 67.26 86.33 47.72 68.93 75.63
WC-HGCN
F1 60.19 82.20 35.59 67.35 75.81
ACC 70.13 88.06 50.85 71.04 77.82
EMGAN(ours)
F1 65.68 84.52 37.79 70.38 76.64

In our experiment, we preprocess all datasets, encompassing the • LSTM (Liu, Qiu, & Huang, 2016): The model excels at handling
filtration of special characters, segmentation, elimination of stop words, sequential data and utilizes the last hidden state to represent the
and removal of low-frequency words occurring less than five times. entire text, making it widely applicable for tasks involving textual
Table 1 presents comprehensive information about the dataset, encom- data processing.
passing document count, category quantity, average sentence length, • TextGCN5 : TextGCN (Yao et al., 2019) applies graph convolu-
word count, and the proportion of documents containing entities. In our tional networks to represent a text corpus as a graph, capturing
dataset, most text (approximately 80%) incorporates entities. Regard- informative features by treating words as nodes. This method
transforms text classification into node classification.
ing the MR dataset, we refrained from word deletion after performing
• HGAT6 (Yang et al., 2021): The Heterogeneous Graph Attention
data cleansing due to the brevity of sentences.
Network is used to model entities, topics, and document corpora
Regarding dataset allocation, we randomly sampled 40 labeled
by embedding HIN. It is employed for short text classification
short-text documents per class. Half were used for training, and the based on a dual attention mechanism.
other half for parameter tuning validation. In addition, we randomly • STGCN7 (Ye et al., 2020): The model represents words, topics,
sampled 1,000 unlabeled documents for training, in which HIG is and documents in a corpus as a graph, combines the node repre-
generated in the training set. In addition, we selected 1,000 unlabeled sentations obtained through Bi-LSTM and Bert word embeddings,
documents for training. Most texts contained entity attributes and two and is directly fed into a softmax layer for classification.
pre-trained word embedding models, Word2vec and TF-IDF. We then • SHINE8 (Wang, Wang, et al., 2021): SHINE models the corpus as
performed part-of-speech tagging on the short text words, extracted a layered heterogeneous graph composed of word-level compo-
entity, topic, and keyword attributes from the corpus, and established nents, incorporates rich feature information, dynamically learns
edge relationships based on rules to form a short text heterogeneous graph representations of short documents, and facilitates effective
graph. propagation of similar short text labels.
• STHCN (Jin et al., 2022): STHCN devised a short text classifica-
tion method utilizing a dual-channel hypergraph convolutional
4.1.2. Baselines network. This approach learns two distinct representations of
To comprehensively evaluate the performance of our proposed short short text features. It combines them using an attention network
text classification method, we compared it with eleven baseline meth- to enhance the embedding of short text.
ods, as detailed below:

• CNN: Kim (2014) proposed the renowned convolutional neural 5


https://github.com/yao8839836/text_gcn
network (CNN) in deep learning. Our experiments utilized two 6
https://github.com/ytc272098215/HGAT
CNN variations: CNN-rand with random word embeddings and 7
https://github.com/yzhihao/STGCN
8
CNN-pre with pre-trained embeddings. https://github.com/tata1661/SHINE-EMNLP21

9
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Table 3
Test the accuracy and F1 scores of different models of the 𝑋-shpaed structure edge enhancement method.
Model Metrics TagMyNews Snippets Ohsumed MR Twitter
ACC 62.45(+0.73) 83.11(+0.75) 43.35(+0.67) 63.47(+0.72) 63.88(+0.67)
HGAT-X
F1 54.42(+0.61) 75.08(+0.64) 25.40(+0.58) 63.05(+0.69) 63.09(+0.73)
ACC 35.15(+0.41) 70.50(+0.49) 34.28(+0.37) 58.54(+0.36) 64.75(+0.42)
STGCN-X
F1 34.39(+0.38) 70.28(+0.35) 27.51(+0.29) 58.45(+0.34) 64.67(+0.38)
ACC 62.71(+0.21) 82.63(+0.24) 45.74(+0.17) 64.81(+0.23) 72.79(+0.25)
SHINE-X
F1 56.36(+0.15) 81.79(+0.17) 30.87(+0.11) 64.08(+0.19) 72.40(+0.21)
ACC 65.84(+0.41) 86.21(+0.43) 46.76(+0.34) 68.74(+0.30) 75.68(+0.45)
ST-Text-GCN-X
F1 59.03(+0.31) 81.02(+0.39) 32.43(+0.29) 66.82(+0.28) 74.78(+0.42)
ACC 67.89(+0.63) 87.00(+0.67) 48.30(+0.58) 69.54(+0.61) 76.22(+0.59)
WC-HGCN-X
F1 60.78(+0.59) 82.81(+0.61) 36.11(+0.52) 67.92(+0.57) 75.35(+0.54)
ACC 70.13 88.06 50.85 71.04 77.82
EMGAN(ours)
F1 65.68 84.52 37.79 70.38 76.64

• ST-Text-GCN9 (Cui, Wang, Li, & Welsch, 2022): The model uti- 4.2. Experimental results and analysis
lizes self-training on text data, incorporating keywords into the
training dataset. The tagged information propagates along the In the comparative experiments, to verify the excellent classification
structure of the manifold to the target samples. performance of the proposed short text classification method, we com-
• Bert+ TextRNN (Cui et al., 2023): This method uses a fusion pared it with CNN, LSTM, Text GCN, HGAT, STHCN, STGCN, SHINE,
model that combines Bert and TextRNN to finally generate a ST-Text-GCN, and WC-HGCN. Table 2 demonstrates the classification
deep bidirectional language representation that can integrate the outcomes of various techniques across five benchmark datasets. Our
context of both parties. approach surpasses all baseline methods across all datasets, showcasing
• WC-HGCN (Yang, Liu, Zhang, & Zhu, 2023): It Introduces the con- the effectiveness and superiority of our proposed method in the domain
cept of word information to enhance the feature representation of short text classification with sparse features.
of short texts and construct a text-level heterogeneous graph for Upon careful analysis, we observed varied performance among
each sentence by using words and relevant concepts as nodes and CNN-Rand, CNN-pretrain, LSTM-Rand, and LSTM-pretrain. While both
updating the nodes through the designed strategy. CNN and LSTM utilize pre-trained word embeddings, CNN excels
in capturing contiguous and close-range semantics. Therefore, pre-
For all the baseline methods mentioned above, we first preprocess training on the Snippets dataset is more effective for CNN. TextGCN and
our dataset and run the source code provided by the authors. Some may STGCN models, based on graph neural networks, have achieved results
choose to present the results reported in previous research papers (The comparable to the deep models CNN-Pretrain and LSTM-Pretrain. ST-
results are directly displayed, including some baseline data from HGAT Text-GCN is an enhancement of the TextGCN model. It augments
and SHINE, while the remaining data is acquired by running the source the training set with self-training, thereby incorporating keywords
code.). Entity information is obtained from Wikipedia. For example, and leading to significantly higher accuracy. This is attributed to the
CNN and LSTM deep neural networks use entity embeddings trained on ability of the text graph to capture both document-word relation-
the same Wikipedia corpus. TextGCN, HGAT, STGCN, SHINE, STHCN, ships and global word-word relationships. However, when we compare
ST-Text-GCN, and WC-HGCN choose to capture feature information TextGCN with HGAT, the overall accuracy is relatively lower. This
by constructing graphs for better classification performance. We select is because HGAT incorporates heterogeneous information network
these baseline methods to perform better comparisons. structure (HIN) and attention mechanisms, allowing it to learn the
weights of neighboring nodes adaptively. This highlights the superiority
of heterogeneous graphs and attention mechanisms. Consequently,
4.1.3. Parameter settings
the accuracy of STHCN, which combines attention networks, also
Our approach has been validated by selecting optimal parameter
performs well. SHINE has demonstrated strong performance on nu-
values for 𝑔, 𝑇 , and 𝛿 to achieve the best performance. For constructing
merous datasets, and the analysis suggests that its dynamic learning
the heterogeneous graph, we set the similarity threshold 𝛿 between
of short document graphs can facilitate effective label propagation.
entities to 0.5 for all datasets, select the top 𝐹 = 2 words with the Bert+ TextRNN achieves remarkable performance by leveraging Bert
highest probabilities as the topic words, and assign documents to these pre-training and TextRNN model to capture temporal information and
high-probability words. In the LDA topic model, we set the number of long-distance dependencies in the text. Especially on the Snippets
topics to 𝑔 = 20 for the Snippets dataset, 𝑔 = 15 for the TagMyNews, dataset, it achieves an accuracy of 87.54%, second only to our model.
MR, and Twitter datasets, and 𝑔 = 40 for the Ohsumed dataset. We im- In contrast, WC-HGCN introduces conceptual information about words
plement EMGAN in PyTorch and use the Louvain partitioning method. to enrich the feature representation of short texts, constructing a text-
For all datasets, we set 𝜇 to 1.0, signifying that the number of sampled level heterogeneous graph for each sentence. Compared to the models
margins equals the degree of each node. By defining the maximum above, it has achieved superior results. Furthermore, by comparing
value of 𝑐, we can control the size of the single-layer receptive field. TextGCN, STGCN, and SHINE, we observe that models based on graph
We set the maximum value of 𝑐 to 3 in the first layer and 2 in the neural networks can achieve excellent results in short texts, indicating
second layer. The learning rate is established at 0.005, the dropout that graph structures can extract advanced semantic features from
rate is set to 0.5, and the number of iteration steps is fixed at 8. We sentences. Meanwhile, a comparison of the performance between HGAT
employ the Adam optimizer for training, and if the validation cross- and WC-HGCN demonstrates that incorporating external knowledge
entropy loss does not decrease continuously for 10 consecutive epochs, can bolster the semantic richness of sentences, effectively addressing
the training process is halted. The experiment utilized two evaluation the issue of sparsity in short-text features. As a result, our EMGAN
metrics, namely accuracy and F1 score, to measure the performance of model outperforms other state-of-the-art models in terms of perfor-
short text classification. All methods were executed on a computer with mance on five different datasets, with improvements in accuracy of
an i7-9700kF CPU and an RTX3090 GPU. 2.24%, 1.73%, 3.13%, 2.11%, and 2.19%, underscoring the efficacy
of our approach. This can be attributed to several factors: (1) We
utilize a variety of crucial pieces of information to construct a het-
9
https://github.com/wanggangkun/ST-Text-GCN erogeneous graph, incorporating external knowledge bases to enrich

10
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Table 4
Test accuracy and F1 score for different models of Min-margin graph attention network.
Model Metrics TagMyNews Snippets Ohsumed MR Twitter
ACC 62.17(+0.45) 82.83(+0.47) 43.00(+0.32) 63.16(+0.41) 63.65(+0.44)
HGAT-MM
F1 54.22(+0.41) 74.88(+0.44) 25.10(+0.28) 62.72(+0.36) 62.86(+0.38)
ACC 35.37(+0.63) 70.69(+0.68) 34.48(+0.57) 58.79(+0.61) 64.97(+0.64)
STGCN-MM
F1 34.60(+0.59) 70.53(+0.60) 27.73(+0.51) 58.67(+0.56) 65.17(+0.58)
ACC 64.18(+0.74) 84.23(+0.78) 46.72(+0.65) 65.50(+0.69) 73.96(+0.72)
STHCN-MM
F1 56.95(+0.67) 78.70(+0.73) 31.87(+0.59) 65.06(+0,62) 73.66(+0.65)
ACC 66.00(+0.57) 86.42(+0.64) 46.91(+0.49) 68.99(+0.55) 75.79(+0.56)
ST-Text-GCN-MM
F1 59.20(+0.48) 81.24(+0.61) 32.56(+0.42) 67.01(+0.47) 75.87(+0.48)
ACC 70.13 88.06 50.85 71.04 77.82
EMGAN(ours)
F1 65.68 84.52 37.79 70.38 76.64

Table 5 In the model for the short text classification task, we designed a
Ablation experiment of EMGAN.
minimum margin graph attention network to achieve the purpose of
TextGCN HIG X-shaped MMGAN ACC F1
enriching feature information. This model was used for the first time in
✓ 77.82 71.95 short text tasks, and the model improves well in HGAT, STGCN, STHCN,
✓ ✓ 79.13 73.58
and ST-Text-GCN. Firstly, Our model excels in STGCN, STHCN, and ST-
✓ ✓ 78.95 72.34
✓ ✓ 80.42 75.15 Text-GCN, mainly because ST-Text-GCN builds a text graph based on
✓ ✓ ✓ 84.91 78.38 word co-occurrence and document-word relationships. However, this
✓ ✓ ✓ ✓ 88.06 84.52 graph only includes word and document nodes, limiting the available
information. In the STGCN method, a topic model extracts the short
text graph of topic words. The node information in this graph only
has topic information, the node type is missing, and the word node
semantics. (2) We employ an edge enhancement approach based on
heterogeneous graphs, enriching inter-node connectivity by restruc- representation of the short text plays a vital role. STHCN employs dual
turing edge structures. This procedure facilitated the acquisition of channel hypergraph learning to extract two distinct representations of
higher-order relationships within the heterogeneous graph. (3) We short-text features. Subsequently, we enhance short-text embeddings
introduce a network model based on the Minimum Margin Graph by utilizing our minimal margin attention network. This integration
Attention Network. This model employs an attention mechanism to with our proposed model allows for more effective exploration within
comprehensively explore the structure of a heterogeneous graph at the graph, facilitating the capture of additional node information and
minimal cost. It aggregates feature information from distant, high-order the enrichment of feature data. Furthermore, HGAT itself incorpo-
neighbors, effectively addressing the issue of sparse features in short rates an attention mechanism, resulting in no significant performance
texts. improvements. In summary, our proposed minimum margin graph at-
tention network can thoroughly explore the structure of heterogeneous
4.3. Ablation study graphs at minimal cost and aggregate feature information from distant
neighbors.
To verify the impact of the proposed 𝑋-shaped structure Edge The above two ablation experiments demonstrate that EMGAN not
Enhancement approach on our method, we conducted the following only incorporates the idea of 𝑋-shaped structure edge enhancement
experiments by applying the 𝑋-shaped structure Edge Enhancement but also proposes a minimum margin graph attention network, which
approach to the HGAT, STGCN, SHINE, ST-Text-GCN, and WC-HGCN further enriches the feature information of short texts and effectively
methods. The experimental outcomes are illustrated in Table 3, and addresses the problem of sparse feature information in short texts,
based on these results, we can observe the following performances: thus improving classification performance. However, we noticed that
The 𝑋-shaped structure edge enhancement approach can restruc- the F1 value is still relatively low on the Ohsumed dataset, which
ture the edge relationships between nodes, thereby enriching the edge may be because the original Ohsumed dataset may contain multiple
connections. This approach has significantly optimized HGAT, STGCN, labels for each data, and the text information is complex. Nevertheless,
SHINE, ST-Text-GCN and WC-HGCN. Both STGCN and ST-Text-GCN do our method still has room for improvement, indicating that EMGAN
not take into account the heterogeneity of nodes. To address this, we
significantly outperforms all variants.
consider their nodes homogeneous, although this approach may result
With respect to the effects of the three mechanisms involved by
in the loss of some feature information. However, our 𝑋-shaped struc-
EMGAN, i.e., the heterogeneous information graph, 𝑋-shaped struc-
ture edge enhancement method can establish rich edge relationships,
ture enhancement, and minimum margin graph attention network,
preserving core node features and their interrelationships, such as enti-
and some combinations of them, we take TextGCN as the baseline
ties, topics, keywords, and other essential characteristics. However, the
of performance, and the experimental results are shown in Table 5.
performance improvement on SHINE was not significant. Our analysis
suggests that this is due to the use of hierarchical graph construction To facilitate a systematic comparison, we enumerate the results one
in SHINE, where nodes of the same class are present in each layer and by one. The table reveals that all mechanisms contribute to the en-
have already formed close relationships. Our proposed method for edge hancement of TextGCN, with the EMGAN fusion mechanism showing
enhancement shows significant improvements in HGAT and WC-HGCN, the most pronounced effect, boosting the classification accuracy from
especially with HGAT-𝑋 achieving a classification accuracy of 83.11% 77.82% to 88.06%. The individual application of each mechanism on
on Snippets. The experimental results demonstrate that our 𝑋-shaped TextGCN results in respective improvements of 1.31%, 1.13%, and
structure edge enhancement method effectively addresses the issue of 2.60%. We discover that using the 𝑋-shaped enhancement in isolation
sparse edge relationships in a short text, significantly improving model yields minimal improvements. This can be attributed to TextGCN graph
performance and validating the effectiveness of our approach. construction being based on word co-occurrence. While we have in-
To demonstrate the effectiveness of our proposed minimum margin troduced the 𝑋-shaped enhancement in TextGCN to strengthen edge
graph attention network, we compared our EMGAN model with four structures, the information type in its graph construction remains sin-
variant models, namely HGAT, STGCN, STHCN and ST-Text-GCN. The gular. This has a certain impact, albeit relatively minor. However,
comparison results are presented in Table 4. when combined with the other two mechanisms, it can significantly

11
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Fig. 6. The test accuracy with different numbers of labeled documents.

Fig. 7. The average accuracy with different numbers of entities, keywords, topics, top F relevant topics and similarity threshold 𝛿 between entities on TagMyNews and Twitter
datasets.

enhance performance. Subsequently, we incrementally introduce mech- noticeable improvement when the proportion of labeled documents is
anisms on an individual basis, achieving 84.91% effectiveness when relatively large. This is attributed to our EMGAN method, which con-
simultaneously employing HIG and the 𝑋-shaped method. This is be- nects more nodes to obtain more node feature information, effectively
cause HIG extracts three types of feature information, enriching the propagating the labeled data and maximizing its utilization to achieve
information on the graph. Subsequently, the 𝑋-shaped method en- accurate short text classification performance.
hances the graph structure by establishing higher-order connections.
The EMGAN, utilizing all three mechanisms simultaneously, demon- 4.5. Parameter analysis
strates the best performance. This underscores the effectiveness of
MMGAN, built upon the foundations of HIG and the 𝑋-shaped struc- This section examines the parameter impact on our method through
ture. By employing attention to minimize edge distances on high-order analysis. Selecting topics, entities, and keywords is crucial for our
heterogeneous graphs, it comprehensively explores their structure. Fur- composition method, as it determines semantic capture and algorithm
thermore, it aggregates feature information from distant high-order runtime. To verify our hypothesis, we experimented and visualized the
neighbors, effectively addressing the issue of sparse features in short results for reference. Fig. 7 shows the test accuracy on the TagMyNews
texts. and Twitter datasets for different numbers of topics, top-related topics,
entities, and keywords. For the number of topics as the number of topics
4.4. Labeled data increases, the accuracy also improves. However, this trend continues
until 15, after which the accuracy decreases as the number of topics
In order to evaluate the impact of labeled data size, we selected increases. The top 𝐹 related topics assigned to a specified document
four relevant algorithms for testing, including CNN-Pretrain, TextGCN, work best when 𝐹 = 2 and show a downward trend when 𝐹 exceeds 2.
HGAT, and EMGAN. We systematically manipulated the proportion We have also experimented with the performance of different numbers
of annotated documents across different datasets. We assessed their of entities and keywords. We have noticed that as the number of
respective test accuracies on the TagMyNews, Snippets, MR, Ohsumed, selected entities and keywords grows, the testing accuracy initially
and Twitter datasets. Each method was executed ten times, and the improves. However, once the count exceeds 5, the accuracy starts
average performance was computed to yield the results. As shown in to decline. We hypothesize that this may be because the number of
Fig. 6, all algorithms performed well on these datasets, with accu- entities in the document is inherently much smaller than the number
racy increasing as the proportion of labeled data increased. TextGCN, of topics and keywords, and selecting too many keywords can increase
HGAT, and EMGAN based on graph convolutional networks achieved the complexity of the heterogeneous information network. This may
accuracy. This indicates that methods based on graph convolutional cause redundant edge relationships between unrelated nodes, making
networks can effectively enhance information propagation through 𝑋- model classification more challenging. We set these four parameters in
shaped structure edge augmentation and the minimum margin graph our experiments based on each dataset’s validation set. For the three
attention network model, enabling better utilization of limited labeled hyperparameters within our model: the sampling rate for margins 𝜇,
data. When the proportion of labeled documents provided is relatively the depth of margins 𝐶, and the number of iterations 𝐼𝑡𝑒𝑟, we vary each
small, the performance of baseline methods decreases significantly. In to analyze our model’s sensitivity to these factors. As shown in Fig. 8,
contrast, our method still achieves relatively high accuracy. There is a all these outcomes are derived from the Snippets dataset. Regarding

12
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Fig. 8. In the context of the MMGAN model, a sensitivity analysis is conducted regarding the three hyperparameters: margin length 𝐶, number of iterations 𝐼𝑡𝑒𝑟, and sampling
ratio 𝜇.

the sampling rate, we held the Iter constant at 8 and 𝐶 at 2, achieving between nodes, benefiting network analysis and financial modeling.
optimal performance at 1.0. This signifies that we employed the same MMGAN improves tasks like text summarization and sentiment anal-
number of paths as the degree of each node. Subsequently, we fixed ysis and enhances personalized recommendations in recommendation
the sampling ratio at 1.0 and varied the number of iterations. The systems.
outcome reveals that achieving satisfactory performance requires only Therefore, EMGAN, combining HIG, Edge Enhancement, and MM-
eight iterations. Furthermore, we have adjusted the maximum distance GAN, offers a comprehensive understanding of short text content and
of the path from 2 to 5, resulting in an optimal performance of 2. finds applications in various domains beyond classification, including
information retrieval, recommendation systems, social media analysis,
4.6. Computing complexity and customer feedback. However, although our scheme achieves good
results in short text classification, there are still areas for optimization.
Many real-world sparse graphs can be represented with values In future work, we plan to optimize from three perspectives: further
and indices within an 𝑂(𝐵) space complexity, where 𝐵 represents the enriching short text HIG, reducing information redundancy problems,
number of edges in the graph. The computed edge distance matrix 𝑅 is and improving algorithm performance.
truncated by 𝐶 and further simplified through sampling. In the exper-
iment, the spatial complexity of the attention layer is approximately CRediT authorship contribution statement
2 to 3 times that of a first-order attention layer. Taking the Snippets
dataset as an example, the GAT model requires about 800 megabytes Wei Ai: Supervision, Investigation, Writing – review & editing.
of GPU memory, whereas our EMGAN model only incurs approximately Yingying Wei: Conceptualization, Methodology, Data curation, Writ-
300 megabytes of GPU memory. Regarding running time, the running ing – original draft. Hongen Shao: Supervision, Writing – review &
time complexity of the shortest path algorithm Dijkstra we adopted is editing. Yuntao Shou: Supervision, Investigation, Writing – review
𝑂(𝑉 ′ 𝐵𝑙𝑜𝑔𝐵), where 𝑉 ′ is the number of nodes in the graph. Accord- & editing. Tao Meng: Supervision, Investigation, Writing – review
ing to the index and value of 𝑅, the sparse operator (Fey, Lenssen, & editing. Keqin Li: Supervision, Investigation, Writing – review &
Weichert, & Müller, 2018) is used to implement the path attention editing.
mechanism, making full use of the computing power of the GPU. On
the RTX3090 GPU, the runtime for epochs with margin attention on Declaration of competing interest
the Snippets dataset is 0.3 s.
The authors declare that they have no known competing finan-
5. Conclusion cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
This paper proposes a novel Edge-Enhanced Minimum-Margin Graph
Attention Network (EMGAN) for short text classification. This method Data availability
optimizes the global topological structure to capture high-order feature
information accurately. Specifically, we introduce a novel heteroge- Data will be made available on request.
neous information graph (HIG) methodology to address the limitations
of external knowledge by extracting themes, entities, and keywords Acknowledgments
as feature extensions. Subsequently, we incorporate an edge enhance-
ment method based on an 𝑋-shaped structure, which reconstructs The authors deepest gratitude goes to the anonymous reviewers and
the edge structure between nodes, thus reinforcing edge relationships AE for their careful work and thoughtful suggestions that have helped
and obtaining a high-order heterogeneous graph with an 𝑋-shaped improve this paper substantially. This work is supported by National
structure. Furthermore, we devise a Minimum-Margin Graph Attention Natural Science Foundation of China (Grant No. 69189338), Excellent
Network (MMGAN) for short text classification. This model aggregates Young Scholars of Hunan Province of China (Grant No. 22B0275), and
feature information from high-order neighbors and captures their rich Changsha Natural Science Foundation, China (Grant No. kq2202294).
relationships to mitigate the issue of sparse short text features. Ex-
tensive experimental results demonstrate the superiority of our model References
across various short text datasets compared to existing methods. It
Ai, W., Wang, Z., Shao, H., Meng, T., & Li, K. (2023). A multi-semantic pass-
effectively overcomes the sparsity of short text data and the inadequacy ing framework for semi-supervised long text classification. Applied Intelligence:
of semantic features, yielding significant improvements in short text The International Journal of Artificial Intelligence, Neural Networks, and Complex
classification tasks. Problem-Solving Technologies, 1–17.
First, we would like to highlight that the Heterogeneous Informa- Balomenos, T., Raouzaiou, A., Ioannou, S., Drosopoulos, A., Karpouzis, K., & Kollias, S.
(2005). Emotion analysis in man-machine interaction systems. Machine Learning for
tion Graph (HIG) technology integrates entities, topics, and keywords, Multimodal Interaction, 3361, 318–328.
enhancing search results in information retrieval and providing deeper Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of
insights in network analysis. Edge Enhancement enriches relationships Machine Learning Research, 3, 993–1022.

13
W. Ai et al. Expert Systems With Applications 251 (2024) 124069

Blondel, V. D., Guillaume, J. L., Lambiotte, R., et al. (2008). Fast unfolding of Rousseau, F., Kiagias, E., & Vazirgiannis, M. (2015). Text categorization as a graph
community hierarchies in large networks. classification problem. In Proceedings of the 53rd annual meeting of the association for
Chakraborty, S., & Singh, A. (2022). Active sampling for text classification with subin- computational linguistics and the 7th international joint conference on natural language
stance level queries. In Proceedings of the AAAI conference on artificial intelligence: processing (volume 1: Long papers) (pp. 1702–1712).
Vol. 36, (6), (pp. 6150–6158). Sidhu, D., Nair, R., & Abdallah, S. (1991). Finding disjoint paths in networks. In
Chen, Q., Yao, L., & Yang, J. (2016). Short text classification based on LDA topic Proceedings of the conference on communications architecture & protocols (pp. 43–51).
model. In 2016 international conference on audio, language and image processing (pp. Vitale, D., Ferragina, P., & Scaiella, U. (2012). Classification of short texts by deploying
749–753). IEEE. topical annotations. In ECIR (pp. 376–387). Springer.
Cui, H., Wang, G., Li, Y., & Welsch, R. E. (2022). Self-training method based on GCN Wang, X., Chen, R., Jia, Y., & Zhou, B. (2013). Short text classification using
for semi-supervised short text classification. Information Sciences, 611, 18–29. wikipedia concept based document representation. In 2013 international conference
Cui, H., Wang, C., & Yu, Y. (2023). News short text classification based on bert model on information technology and applications (pp. 471–474). IEEE.
and fusion model. Highlights in Science, Engineering and Technology, 34, 262–268. Wang, C., Jiang, H., Chen, T., Liu, J., Wang, M., Jiang, S., et al. (2022). Entity
Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks understanding with hierarchical graph learning for enhanced text classification.
on graphs with fast localized spectral filtering. Advances in Neural Information Knowledge-Based Systems, 244, Article 108576.
Processing Systems, 29. Wang, Z., Liu, X., Yang, P., Liu, S., & Wang, Z. (2021). Cross-lingual text classification
Dijkstra, E. W. (2022). A note on two problems in connexion with graphs. In Edsger with heterogeneous graph neural network. arXiv preprint arXiv:2105.11246.
Wybe Dijkstra: His life, work, and legacy (pp. 287–290). Wang, C., Song, Y., Li, H., Zhang, M., & Han, J. (2016). Text classification with
Fey, M., Lenssen, J. E., Weichert, F., & Müller, H. (2018). Splinecnn: Fast geometric heterogeneous information network kernels. In Proceedings of the AAAI conference
deep learning with continuous b-spline kernels. In Proceedings of the IEEE conference on artificial intelligence: Vol. 30, (1).
on computer vision and pattern recognition (pp. 869–877). Wang, Y., Wang, S., Yao, Q., & Dou, D. (2021). Hierarchical heterogeneous graph
Flisar, J., & Podgorelec, V. (2020). Improving short text classification using information representation learning for short text classification. arXiv preprint arXiv:2111.
from DBpedia ontology. Fundamenta Informaticae, 172(3), 261–297. 00180.
Graves, A., & Graves, A. (2012). Long short-term memory. Supervised Sequence Labelling Wang, Y., Wang, H., Zhang, X., Chaspari, T., Choe, Y., & Lu, M. (2019). An attention-
with Recurrent Neural Networks, 37–45. aware bidirectional multi-residual recurrent neural network (abmrnn): A study
Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. about better short-term text classification. In ICASSP 2019-2019 IEEE international
Science, 349(6245), 261–266. conference on acoustics, speech and signal processing (pp. 3582–3586). IEEE.
Hua, J., Sun, D., Hu, Y., Wang, J., Feng, S., & Wang, Z. (2024). Heterogeneous Wang, J., Wang, Z., Zhang, D., & Yan, J. (2017). Combining knowledge with deep
graph-convolution-network-based short-text classification. Applied Sciences, 14(6), convolutional neural networks for short text classification.. In IJCAI: Vol. 350, (pp.
2279. 3172077–3172295).
Jin, L., Sun, Z., & Ma, H. (2022). Short text classification method with dual channel Wu, M. (2023). Commonsense knowledge powered heterogeneous graph attention net-
hypergraph convolution networks. In 2022 8th international conference on systems works for semi-supervised short text classification. Expert Systems with Applications,
and informatics (pp. 1–6). IEEE. 232, Article 120800.
Joachims, T. (2005). Text categorization with support vector machines: Learning with Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Philip, S. Y. (2020). A comprehensive
many relevant features. In Machine learning: ECML-98: 10th European conference on survey on graph neural networks. IEEE Transactions on Neural Networks and Learning
machine learning chemnitz, Germany, April 21–23, 1998 proceedings (pp. 137–142). Systems, 32(1), 4–24.
Springer. Xia, S., Peng, D., Meng, D., Zhang, C., Wang, G., Giem, E., et al. (2020). A fast
Kateb, F., & Kalita, J. (2015). Classifying short text in social media: Twitter as case adaptive k-means with no bounds. IEEE Transactions on Pattern Analysis and Machine
study. International Journal of Computer Applications, 111(9), 1–12. Intelligence.
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint Xia, S., Wang, G., Chen, Z., Duan, Y., et al. (2018). Complete random forest based
arXiv:1408.5882. class noise filtering learning for improving the generalizability of classifiers. IEEE
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph Transactions on Knowledge and Data Engineering, 31(11), 2063–2078.
convolutional networks. arXiv preprint arXiv:1609.02907. Yang, T., Hu, L., Shi, C., Ji, H., Li, X., & Nie, L. (2021). HGAT: Heterogeneous graph
Li, P., Liu, Y., Hu, Y., Zhang, Y., Hu, X., & Yu, K. (2022). A drift-sensitive distributed attention networks for semi-supervised short text classification. ACM Transactions
LSTM method for short text stream classification. IEEE Transactions on Big Data, on Information Systems (TOIS), 39(3), 1–29.
9(1), 341–357. Yang, S., Liu, Y., Zhang, Y., & Zhu, J. (2023). A word-concept heterogeneous graph
Linmei, H., Yang, T., Shi, C., Ji, H., & Li, X. (2019). Heterogeneous graph attention convolutional network for short text classification. Neural Processing Letters, 55(1),
networks for semi-supervised short text classification. In Proceedings of the 2019 735–750.
conference on empirical methods in natural language processing and the 9th international Yao, D., Bi, J., Huang, J., & Zhu, J. (2015). A word distributed representation
joint conference on natural language processing (pp. 4821–4830). based framework for large-scale short text classification. In 2015 international joint
Liu, Y., Li, P., & Hu, X. (2022). Combining context-relevant features with multi-stage conference on neural networks (pp. 1–7). IEEE.
attention network for short text classification. Computer Speech and Language, 71, Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification.
Article 101268. In Proceedings of the AAAI conference on artificial intelligence: Vol. 33, (01), (pp.
Liu, P., Qiu, X., & Huang, X. (2016). Recurrent neural network for text classification 7370–7377).
with multi-task learning. arXiv preprint arXiv:1605.05101. Ye, Z., Jiang, G., Liu, Y., Li, Z., & Yuan, J. (2020). Document and word representations
Lu, S.-H., Chiang, D.-A., Keh, H.-C., & Huang, H.-H. (2010). Chinese text classification generated by graph convolutional network and bert for short text classification. In
by the Naïve Bayes Classifier and the associative classifier with multiple confidence ECAI 2020 (pp. 2275–2281). IOS Press.
threshold values. Knowledge-Based Systems, 23(6), 598–604. Yu, H.-F., Ho, C.-H., Arunachalam, P., Somaiya, M., & Lin, C.-J. (2012). Product title
Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in classification versus text classification. Csie. Ntu. Edu. Tw, 1–25.
networks. Physical Review E, 69(2), Article 026113. Zhang, B., He, Q., & Zhang, D. (2022). Heterogeneous graph neural network for short
Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment text classification. Applied Sciences, 12(17), 8711.
categorization with respect to rating scales. arXiv preprint cs/0506075. Zhang, W., Yoshida, T., & Tang, X. (2008). Text classification based on multi-word
Pham, P., Nguyen, L. T., Pedrycz, W., & Vo, B. (2023). Deep learning, graph-based text with support vector machine. Knowledge-Based Systems, 21(8), 879–886.
representation and classification: a survey, perspectives and challenges. Artificial Zhou, Y., Li, J., Chi, J., Tang, W., & Zheng, Y. (2022). Set-CNN: A text convolu-
Intelligence Review, 56(6), 4893–4927. tional neural network based on semantic extension for short text classification.
Phan, X.-H., Nguyen, L.-M., & Horiguchi, S. (2008). Learning to classify short and sparse Knowledge-Based Systems, 257, Article 109948.
text & web with hidden topics from large-scale data collections. In Proceedings of Zhou, Y., Xu, B., Xu, J., Yang, L., & Li, C. (2016). Compositional recurrent neural
the 17th international conference on World Wide Web (pp. 91–100). networks for chinese short text classification. In 2016 IEEE/WIC/aCM international
Ragesh, R., Sellamanickam, S., Iyer, A., Bairi, R., & Lingam, V. (2021). Hetegcn: conference on web intelligence (pp. 137–144). IEEE.
heterogeneous graph convolutional networks for text classification. In Proceedings of
the 14th ACM international conference on web search and data mining (pp. 860–868).

14

You might also like