0% found this document useful (0 votes)
24 views11 pages

Local Discriminative Graph Convolutional Networks For Text

This document presents a new method called local discriminative graph convolutional networks (LDGCN) for text classification. LDGCN aims to address limitations of existing graph convolutional networks (GCN) that ignore local intra-class diversity and inter-class similarity within text data. LDGCN constructs local inter-class and intra-class scatter matrices to introduce discriminative information into GCN training. This allows LDGCN to map texts from the same class closely together and texts from different classes farther apart, improving on GCN which only minimizes cross-entropy loss. The paper proposes LDGCN and evaluates it against baselines, demonstrating improved performance for text classification.

Uploaded by

nithin reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

Local Discriminative Graph Convolutional Networks For Text

This document presents a new method called local discriminative graph convolutional networks (LDGCN) for text classification. LDGCN aims to address limitations of existing graph convolutional networks (GCN) that ignore local intra-class diversity and inter-class similarity within text data. LDGCN constructs local inter-class and intra-class scatter matrices to introduce discriminative information into GCN training. This allows LDGCN to map texts from the same class closely together and texts from different classes farther apart, improving on GCN which only minimizes cross-entropy loss. The paper proposes LDGCN and evaluates it against baselines, demonstrating improved performance for text classification.

Uploaded by

nithin reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Multimedia Systems (2023) 29:2363–2373

https://doi.org/10.1007/s00530-023-01112-y

REGULAR PAPER

Local discriminative graph convolutional networks for text


classification
Bolin Wang1 · Yuanyuan Sun1 · Yonghe Chu2 · Changrong Min1 · Zhihao Yang1 · Hongfei Lin1

Received: 21 February 2023 / Accepted: 14 May 2023 / Published online: 29 May 2023
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023

Abstract
Recently, graph convolutional networks (GCNs) has demonstrated great success in the text classification. However, the GCN
only focuses on the fitness between the ground-truth labels and the predicted ones. Indeed, it ignores the local intra-class
diversity and local inter-class similarity that is implicitly encoded by the graph, which is an important cue in machine learning
field. In this paper, we propose a local discriminative graph convolutional network (LDGCN) to boost the performance of
text classification. Different from the text GCN that minimize only the cross entropy loss, our proposed LDGCN is trained
by optimizing a new discriminative objective function. So that, in the new LDGCN feature spaces, the texts from the same
scene class are mapped closely to each other and the texts of different classes are mapped as farther apart as possible. So as
to ensure that the features extracted by GCN have good discriminative ability, achieve the maximum separability of samples.
Experimental results demonstrate its superiority against the baselines.

Keywords Text classification · Graph convolutional network · Discriminative information · Manifold structures

1 Introduction network (CNN) [4] and recurrent neural network (RNN)


[5]. Although the above methods have made great progress,
Text classification as a basic task in natural language pro- they can only deal with the data in Euclidean space, and
cessing (NLP), its goal is to infer the label of a given text cannot fully and effectively express the semantic informa-
[1–3]. Recently, deep learning has achieved great suc- tion of the text.
cess in text classification, such as convolutional neural Recently, graph convolutional networks (GCN) [6] have
also demonstrated great success in text classification, espe-
cially in analyzing irregular grids with the non-Euclidean
* Yuanyuan Sun geometry. Yao et al. [7] apply GCN to corpus-level hetero-
[email protected]
geneous text graphs propose TextGCN. In their framework,
Bolin Wang they take the entire corpus and build a graph with words
[email protected]
and documents as nodes. Vashishth et al. [8] proposed the
Yonghe Chu SynGCN model, a graph CNN that uses syntactic context
[email protected]
to learn word embeddings and proposed a SemGCN frame-
Changrong Min work for integrating various semantic knowledge. Liu et al.
[email protected]
[9] proposed a tensor GCN for text classification. Ragesh
Zhihao Yang et al. [10] combined PTE and TextGCN to propose a het-
[email protected]
erogeneous graph convolutional network (HeteGCN) model.
Hongfei Lin Liu et al. [11] propose a deep attention-diffusion GNN to
[email protected]
learn textual representations, which bridge the difficult gap
1
Faculty of Electronic Information and Electrical between words and their distant neighbors.
Engineering, Dalian University of Technology, No.2 Although GCN-based methods have demonstrated suc-
Linggong Road, Dalian 116024, Liaoning, China cess in text classification, they still face some challenges,
2
School of Information Science and Engineering, Henan such as the local intra-class diversity and local inter-class
University of Technology, No. 100 Lianhua Street, similarity, these are significant sources of text classification
Zhengzhou 450001, Henan, China

13
Vol.:(0123456789)
2364 B. Wang et al.

errors. For this reason, how to learn a more powerful Text- improved Text CNN models have emerged. For instance, Yin
GCN feature representation with smaller intra-class disper- et al. [16] introduced a model that incorporates both CNN
sion and larger inter-class separation at the same time is and Attention mechanisms, utilizing Euclidean distance to cal-
an urgent problem to be solved. Existing GNN-based text culate the attention matrix, and then applying it to the CNN
classifiers only make use of softmax or cross entropy objec- pooling operation. Zhang et al. [12] proposed a character-level
tive function to learn an optimal text representation that is CNN, which uses features as the basic unit of input, allow-
most similar to the ground-truth label for each document, ing for greater versatility, and tries to learn from character
while these classifiers fail to consider both intra-class and sequences to obtain text representations. Liu et al. [17] used
inter-class manifold structures of samples in a corpus. The long short-term memory (LSTM) to learn text representation
intra-class term describes manifold structures of samples in for text classification. Although these methods have proven
the same class, while the inter-class term describes mani- to be effective, they typically focus on local, contiguous word
fold structures of samples from different classes. Addition- sequences without explicitly leveraging the global word co-
ally, considering the manifold structures is able to further occurrence information present in the corpus. Recently, pre-
improve performance on various classification tasks. As trained models such as BERT [18] and RoBERTa [19] have
such, how to make full use of manifold structures of docu- shown impressive results in different NLP tasks, even with
ments has become a crucial task for text classification. limited training data. Nevertheless, utilizing these pre-trained
To overcome the aforementioned challenges, we pro- models necessitates significant computation and external
pose a local discriminative graph convolutional network knowledge resources, which might not always be readily
(LDGCN). Our approach involves constructing the local available.
inter-class scatter matrix and local intra-class scatter of text Recently, Graph-based neural networks have shown supe-
data, which are introduced into the TextGCN as a discrimi- rior ability to capture global information compared to sequen-
native term. In the new LDGCN feature spaces, texts from tial learning models and have been widely applied in NLP
the same class are mapped closely together, while texts from tasks [20–23]. Yao et al. [7] utilized standard GCN [24] in
different classes are mapped as far apart as possible. In con- text classification. Cao et al. [25] encodes graphs from dif-
trast to TextGCN, which solely minimizes the cross-entropy ferent perspectives, allowing it to learn aligned embeddings
loss, our proposed method can minimize the intra-class that enhance the model’s robustness to structural changes. Dai
distance and maximize the inter-class distance while mini- et al. [26] introduced the graph fusion network (GFN), which
mizing the cross-entropy loss function. This feature makes integrates external knowledge and multiple views of the text
the LDGCN model more discriminative and improves the graph to capture sufficient structural information. Several stud-
effectiveness of TextGCN in text classification. Two contri- ies have focused on combining local and global information to
butions are as follows: enhance GCN research. Zhu et al. [27] proposed a global and
local dependency-guided GCN (GL-GCN). Jin et al. [28] pro-
1. To solve the problem that GCN ignores the local intra- posed BiTe-GCN, which employs bi-directional convolution
class diversity and local inter-class similarity, we pro- of topology and features to merge global and local information
pose a novel GCN method, our proposed LDGCN is in a text-rich network for modeling.
trained by optimizing a new discriminative objective Our method is also based on the remarkable GCN frame-
function. work. However, different from prior work, our approach is
2. We design a discriminant regularization framework, primarily concerned with enhancing the GCN feature repre-
which uses manifold learning to capture the local dis- sentations through local intra-class diversity and local inter-
criminative manifold structure of text data. class similarity. Our method not only preserves the GCN
model’s capability to capture global information but also
considers the local manifold information of the data. To the
2 Related work best of our knowledge, local manifold information has been
extensively applied in the area of image processing [29–31].
Deep learning has shown exceptional performance in text However, there is currently no research on the impact of
classification, existing commonly used approaches based on local intra-class diversity and local inter-class similarity on
CNN [12], RNN [13], and combinations of various models the performance of GCN for text classification tasks.
with other traditional techniques [14, 15]. Among these, CNN
has proven to be particularly effective. Kim et al. [4] proposed
a classic Text CNN model, which takes a matrix composed 3 Method
of word vectors as input, and obtains a vector representing
the sentence through a convolution operation to handle text This section introduces the LDGCN model. And we pro-
classification tasks. Building on this foundation, a number of vides the overall framework and algorithm tables.

13
Local discriminative graph convolutional networks for text classification 2365

Fig. 1  The framework of


LDGCN

3.1 Overview of the proposed method vocabulary. The weights for document and word nodes are
based on the term frequency-inverse document frequency
Our proposed LDGCN model is illustrated in Fig. 1. We (TF-IDF), while the connections between word nodes are
start by extracting text features, and then compute the intra- determined by global word co-occurrence information. We
class scatter and inter-class scatter of the text features. obtain this information by sliding a fixed-size window across
Finally, we add the scatter as discriminative regular terms the corpus and calculating the point-wise mutual informa-
to the TextGCN loss function. This approach incorporates tion (PMI) between pairs of words to assign connection
discriminative information into the fully connected feature weights. The PMI is calculated as follows:
layer extracted by TextGCN, and enables joint training of the
p(i, j)
TextGCN classification loss and discriminant regularization PMI(i, j) = log (1)
p(i) ⋅ p(j)
loss to obtain the final text classification result. This joint
method enhances the model’s ability to distinguish associa-
tions and improves the feature expression of the model. In p(i, j) = log
N(i, j)
(2)
Fig. 1, the blue dots in the text heterogeneous graph repre- N
sent documents, while the yellow dots represent words. The
solid line represents the connections between documents N(i)
p(i) = log (3)
and words, and the dotted line represents the connections N
between words. We use the adjacency matrix calculated
Where N is the total number of sliding windows, N(i, j) rep-
from the heterogeneous graph as the TextGCN model input.
resents the number of sliding windows containing nodes i
and j at the same time, N(i) represents the number of sliding
3.2 Text graph convolutional networks
windows containing node i, p(i, j) represents the probability
that both nodes i and j are included, and p(i) represents the
To construct a topological graph from a text corpus, we rep-
probability that the sliding window contains node i. Thus,
resent its nodes as a combination of documents and words.
the weight Aij of the edge between nodes i and j is obtained,
The graph consists of |voc| + |doc| nodes, where |doc| repre-
which is defined as follows:
sents the number of documents and |voc| represents the total

13
2366 B. Wang et al.

⎧ PMI(i, j) i, j are words, PMI(i, j) > 0


⎪ TF − IDFij i is document, j is word
Aij = ⎨ (4)
⎪1 i=j
⎩0 otherwise

When PMI(i, j) is a positive value, it implies that word i


and word j have a strong semantic correlation, and when
PMI(i, j) is a negative value, it implies that word i and j have
a low or none semantic correlation. When constructing a het-
erogeneous graph, only directly add edges to the node pairs
with positive PMI, and then input the weighted graph into
a simple two-layer GCN for learning. When building a het- Fig. 2  Conceptual illustration of the discriminative regular term. Dis-
erogeneous graph, we only add edges to the node pairs with criminative learning goes the way a-b to learn the metric/mapping
from the original semantic space a to a new more discriminant mani-
positive PMI values, and then input the weighted graph into fold space b, which reflects the metric calculation process of Eqs. 8,
a simple two-layer GCN for learning. Word and document 9, 10, 11, 12, 13 and 14
embeddings are obtained in the second layer of GCN, and
the dimension of the embedding is the same as the number
of label categories. The extracted feature Z can be calculated ∑∑
C
by formula (5), and finally the node embedding is sent to the J1 = − Y (8)
softmax function to obtain a temporary classification output i∈k j=1
Y, as shown in formula (6).
Where Y = softmax(Z) is the final output and C denotes
̂
Z = AReLU( ̂
AXW0 )W1 (5) the number of final node categories. k is the set of label
nodes.
(2) Discriminative learning regularization term: The sec-
Y = softmax(Z) (6)
ond term J2 (Xk , W0 , W1 ) in Eq. (7) is a discrimination
Where  = D − 12
̂ AD ̂ is the normalized symmetric adja-
− 21 regularization constraint imposed on the GCN features.
cency matrix. D ̂ is the degree matrix corresponding to  , The intra-class scatter Sw and the inter-class scatter Sb

where D̂ij = j Âij . X is a weight matrix composed of the are expressed as:
features of n nodes, W0 , W1 are trainable weight matrices
specific to the first layer and the second layer respectively, ∑ ∑
C Ni
Sw = wkij (H (L) (xik ) − H (L) (xj(k) ))(H (L) (xik ) − H (L) (xj(k) ))T
and ReLU is an activation function between the layers. k=1 j=1
(9)
3.3 Local discriminative graph convolutional

C
networks Sb = Nk similar(𝜇k − 𝜇)(𝜇k − 𝜇)T (10)
k=1
Figure 2 illustrates the concept of discriminative regulariza-
tion through a visual representation. Texts belonging to the H (L) (xi ) is the LDGCN feature layer. There are C categories
same scene class are grouped closely together, while those in the data set, where N = N1 + N2 + ⋯ + NC is the number
from different classes are separated as much as possible. of samples. with Nk indicating the number of samples for
Inspired by manifold learning [32], we add a discrimina- the k-th category, xi(k) is the i-th sample in the k-th category,
tive learning term to the GCN features, so that the GCN 𝜇k is the mean vector of samples of k-th class, 𝜇 is the mean
enhances the discriminative power of the model in addition vector of all samples. 𝜇k and 𝜇 can be calculated by:
to minimizing the cross-entropy loss.
1 ∑ (L) (k)
Nk
𝜇k = H (xi ) (11)
J = min(J1 (Xk , W0 , W1 ) + 𝜆1 J1 (Xk , W0 , W1 )) (7) Nk i=1

where Xk is the training sample in X = {xi |i = 1, ⋯ , N}, J1 is


1 ∑ ∑ (L) (k)
C Nk
the cross entropy loss function, J2 is the discriminant regular
function, 𝜆1 is the weight coefficient used to adjust the two 𝜇= H (xi ) (12)
N k=1 i=1
cost functions.
In Eq. (8), we introduce weight coefficients wij in the calcu-
(1) Cross-entropy loss term: in our study, we define the lation of the intra-class scatter matrix Sw to assign different
cross-entropy loss function: weights between sample pairs, and distinguish the similarity

13
Local discriminative graph convolutional networks for text classification 2367

Table 1  Dataset description Dataset # Docs # Train # Test # Words # Nodes #Classes #Avg Length

Ohsumed 7400 3357 4043 14,157 21,557 23 135.82


R8 7674 5485 2189 7688 15,362 8 65.72
R52 9100 6532 2568 8892 17,992 52 69.82
MR 10,662 7108 3554 18,764 29,426 2 20.39
20NG 18,846 11,314 7532 42,757 61,603 20 221.26

and difference between similar samples by the magnitude of 4 Experiments


the weights, thus enabling the intra-class scatter matrix to
extract more local structural information. In this section, we design multiple sets of comparative
{ (H(L) (x(k) )−H(L) (x(k) ))2 experiments on multiple text classification datasets. We first
e−
i j
xi ∈ k, xj ∈ k introduce the experimental datasets and related compara-
(13)
k
wij = 𝜎

0 xi ∈ k, xj ∉ k tive models. Then, we conduct classification experiments on


multiple public text classification datasets, compare the clas-
When H (L) (xi(k) ) and H (L) (xj(k) ) are closer in the discriminant sification accuracy of each model, and analyze the results.
layer space, the greater the weight coefficient wkij will be, in Next, we will further analyze our model from the perspective
other words, the greater the similarity between samples xik of changing the proportion of training data and document
and xjk , and conversely, the smaller the similarity. This embedding visualization.
approach enables the differentiation of similarities and dif-
4.1 Datasets
ferences between sample pairs within the same category
with different weights.
We evaluate the effectiveness of our model on five bench-
In Eq. (9), we use the similar function to express the
mark text classification datasets including Ohsumed, 1
similarity of the center 𝜇k of the k-th category sample to
R8 and R52,2 Movie Review (MR)3 and 20-Newsgroups
the center 𝜇 of the overall sample. The closer 𝜇k is to 𝜇, the
(20NG).4 Table 1 shows the dataset description. Ohsumed
larger thefunction value is, and conversely, the smaller the
contains abstracts of cardiovascular disease and is generated
similar function value is. The similar function takes values
from the MEDLINE database. R8 and R52 are subsets of the
in the range [0, 1]. We use the similar function to adjust
Reuters dataset that are designed to classify topics. 20NG
the weights of inter-class distances and redefine the inter-
is a dataset for news classification that consists of 20 differ-
class scatter matrix, so that the classes close to each other
ent categories. MR is a sentiment polarity analysis dataset
in category mean can be better separated and improve the
for movie reviews, which is divided into positive and nega-
phenomenon of overlap or crossover between classes.
tive reviews. To begin with, we preprocessed all datasets
(𝜇k −𝜇)2
(14) by performing text cleaning and tokenization, following the
similar = e−
approach described in [7]. Next, we eliminated stop words
t

We obtain a concrete representation of the discriminant using the NLTK5 library and removed words that appeared
regularity term. less than 5 times in the 20NG, R8, R52, and Ohsumed data-
sets. However, we did not remove any words from the MR
J2 = tr(Sw − 𝜆Sb ) (15) dataset after cleaning and tokenizing the raw text, since the
Where 𝜆 is the ratio coefficient for adjusting Sw and Sb , tr is documents in this dataset are extremely short.
the operation to find the trace of the matrix. The loss func-
tion of the proposed method LDGCN in this paper:

∑∑
C
J=− Y + 𝜆1 tr(Sw − 𝜆Sb ) (16)
i∈k j=1

1
http://​disi.​unitn.​it/​mosch​itti/​corpo​ra.​htm.
2
https://​www.​cs.​umb.​edu/​smima​rog/​textm​ining/​datas​ets/
3
http://​www.​cs.​corne​ll.​edu/​people/​pabo/​movie-​review-​data/.
4
http://​qwone.​com/​jason/​20New​sgrou​ps/.
5
http://​www.​nltk.​org/.

13
2368 B. Wang et al.

Table 2  Test Accuracy (%) for Model 20NG R8 R52 Ohsumed MR


different models on five datasets
TD-IDF+LR 83.19 ± 0.00 93.74 ± 0.00 86.95 ± 0.00 54.66 ± 0.00 74.59 ± 0.00
CNN-rand 76.93 + 0.61 94.02 ± 0.57 85.37 ± 0.47 43.87 ± 1.00 74.98 ± 0.70
CNN-non-static 82.15 ± 0.52 95.71 ± 0.52 87.59 ± 0.48 58.44 ± 1.06 77.75 ± 0.72
LSTM 65.71 ± 1.52 93.68 ± 0.82 85.54 ± 1.13 41.13 ± 1.17 75.06 ± 0.44
LSTM-pretrained 75.43 ± 1.72 96.09 ± 0.19 90.48 ± 0.86 51.10 ± 1.50 77.33 ± 0.89
Bi-LSTM 73.18 ± 1.85 96.31 ± 0.33 90.54 ± 0.91 49.27 ± 1.07 77.68 ± 0.86
PV-DBOW 74.36 ±0.18 85.87 ± 0.10 78.29 ± 0.11 46.65 ± 0.19 61.09 ± 0.10
PV-DM 51.14± 0.22 52.07 ± 0.04 44.92 ± 0.05 29.50 ± 0.07 59.47 ± 0.38
Fasttext 79.38 ± 0.30 96.13 ± 0.21 92.81 ± 0.09 57.70 ± 0.49 75.14 ± 0.20
Fasttext-bigram 79.67 ± 0.29 94.74 ± 0.11 90.99 ± 0.05 55.69 ± 0.39 76.24 ± 0.12
PTE 76.74 ± 0.29 96.69 ± 0.13 90.71 ± 0.14 53.58 ± 0.29 70.23 ± 0.36
SWEM 85.16 ± 0.29 95.32 ± 0.26 92.94 ± 0.24 63.12 ± 0.55 76.65 ± 0.63
LEAM 81.91 ± 0.24 93.31 ± 0.24 91.84 ± 0.23 58.58 ± 0.79 76.95 ± 0.45
TextGCN 86.34 ± 0.09 97.07 ± 0.10 93.56 ± 0.18 68.36 ± 0.56 76.74 ± 0.20
TensorGCN 87.74 ± 0.05 98.04 ± 0.08 95.05 ± 0.11 70.11 ± 0.24 77.91 ± 0.07
TLGCN 0.8592 ± 0.03 0.9769 ± 0.01 0.9441 ± 0.06 0.6914 ± 0.04 0.7638 ± 0.04
HINT – 98.12 ± 0.09 95.02 ± 0.18 68.79 ± 0.12 77.03 ± 0.12
CGA2TC – 97.76 ± 0.19 94.47 ± 0.16 70.62 ± 0.45 77.80 ± 0.29
GFN 87.01 98.22 95.31 70.20 78.04
LDGCN 87.79 ± 0.03 98.32 ± 0.05 95.71 ± 0.14 70.85 ± 0.18 78.25 ± 0.11

Bold represents the best result for each column


We have highlighted the best results. We run LDGCN 10 times and report the mean ± standard deviation

4.2 Baselines [40] minimizes the structural entropy to construct the cod-


ing tree of the graph. This approach effectively utilizes the
To compare the classification effect of our model with exist- hierarchical information within the text to improve the text
ing text classification models, we choose the following meth- classification task. CGA2TC [22] introduces contrastive
ods for experiments: learning and adaptive augmentation strategies to obtain
more powerful node representations.
• Traditional machine learning-based method: The TF-
IDF + LR bag-of-words model employs Logistic Regres- 4.3 Experiment settings
sion as its classifier, with weighting based on word fre-
quency inverse document frequency. In the experiments, we use two layers of LDGCN for experi-
• Word embedding based models: PV-DM and PV- ments, the embedding size of the first convolution layer is
DBOW [33], FastText [34], PTE [35], SWEM [36], 200 and the window size is 20. To prevent the model from
LEAM [37]. overfitting, a dropout layer is added after the word embed-
• Sequence deep learning models: We explored several ding layer and after the convolutional layer, respectively. We
models, including CNN-rand [4] that uses the randomly set the dropout rate is set to 0.5, and the learning rate is set
initialized word embeddings and CNN-non-static that to 0.02. We randomly selected 10% of the training set as the
uses the pre-trained word embeddings. We also evaluated validation set and trained TextGCN using Adam [41] for a
the LSTM [38] model, which represents the whole text maximum of 200 epochs.
using its last hidden state. Additionally, we experimented For the baseline model, we used the default parameter
with a bi-directional LSTM model uses pre-trained word settings from the comparison paper. For pretrained word
embeddings, a commonly used approach in text classifi- embedding models, we use the 300-dimensional GloVe [42].
cation. Table 2 shows the results of different methods.
• GNN-based methods: Text GCN [7], TL GCN [39] model
employs text-level graph modeling for the entire corpus
instead of a single graph. GFN [26] model is a Trans-
former-based heterogeneous GNN designed for large-sized
corpora that ignores the text graph’s heterogeneity. HINT

13
Local discriminative graph convolutional networks for text classification 2369

5 Results analysis a window size of 5. Notably, when the window size was set
to 20, both models yielded a test accuracy of 0.8624. Based
In Table 2, LDGCN model shows the significant improve- on these results, we selected a window size of 20 for our
ment over all baseline models across five datasets. Nota- experiments on the 20NG dataset, as this value can better
bly, when using the pre-trained word embeddings, the CNN reflect the effectiveness of our proposed model. For the R8
obtain the best results on the MR dataset, which suggests dataset, our proposed model achieved the highest testing
that it excels at modeling short-range semantics and con- accuracy with a window size of 5, while the lowest accuracy
tinuous data. was observed with a window size of 25. On the other hand,
We observed that TextGCN performs significantly worse TextGCN achieved its highest testing accuracy with a win-
than sequential models on the MR dataset. PTE and fast- dow size of 30 and its lowest with a window size of 15. To
Text exhibit superior performance over PV-DBOW as they obtain an average performance across all window sizes, we
adopt supervised learning approaches to generate document conducted experiments with a sliding window size of 20.The
embeddings, allowing for more discriminative embeddings experiment on the R52 dataset showed that our proposed
through the use of label information. Two recently intro- model and TextGCN achieved the highest test accuracy with
duced methods, SWEM and LEAM, showcase notable a sliding window size of 30. We chose a sliding window
results, highlighting the effectiveness of pooling methods size of 30 for our experiments on this dataset. Regarding
and the use of label descriptions and embeddings. the Ohsumed dataset, our proposed model and TextGCN
It is worth noting that models GNN-based models tend to achieved the best performance with a sliding window size
outperform sequential and bag-of-words models, primarily of 25, while both models performed the worst with a sliding
due to their ability to capture global word co-occurrence window size of 20. Hence, we chose a sliding window size
in the corpus. Compared with the TextGCN and the other of 25 for our experiments. On the MR dataset, both our pro-
variants, LDGCN has significant improvements in terms of posed model and TextGCN achieved the best test accuracy
accuracy on five datasets. Specifically, on the R52, Ohsumed with a sliding window size of 15. Therefore, we selected a
and MR datasets, the results obtained by LDGCN improves sliding window size of 15 for our experiments on this data-
by about 2%. While that on both the 20NG and the R8 set. According to the results, the optimal starting position of
dataset is nearly 1%. Such enhancements demonstrate that the sliding window can be selected for each dataset, which
considering the power of discriminative is effective for the will result in the best classification performance.
long text classification. LDGCN incorporates the concept
of local linear discriminant analysis to develop a novel loss 5.2 Effects of the proportions of labeled data
function for training the text graph convolutional neural
network. This loss function aims to minimize the feature We choose the TextGCN and LDGCN to study the impact
distance within a class while simultaneously maximizing of the number of labelled documents. We vary the ratio of
the feature distance between classes, thereby enhancing the labelled documents and compare their performance on the
GCN’s discriminative feature capabilities. five datasets. Figure 4 reports test accuracies with 1%, 5%,
10%, and 20% of the dataset. We note that our LDGCN
5.1 Effects of the Size of Sliding Window outperforms TextGCN consistently. For instance, LDGCN
achieves a test accuracy of 0.7715 on 20NG with only 10%
In this subsection, we investigate the effects of the size of training documents which are higher than the TextGCN
the sliding window will affect the final model classifica- with even the 20% training documents and a test accuracy
tion performance when building the text map structure. We of 0.9132 on R8 with only 1% training documents. We can
adjusted the parameters of the sliding window on five data- also observe that the results are getting better and better
sets, and the experimental results are shown in Fig. 3. That with the increase of the proportion of labeled data, while
the accuracy of the proposed method is better than TextGCN our method fluctuate less than the experimental results of the
in all cases. We reports test accuracies with 5, 10, 15, 20, TextGCN. This shows that our approach is better equipped
25, 30 size of sliding window. We note that our LDGCN to make the most out of the limited amount of labeled data
outperforms TextGCN consistently. For instance, LDGCN available for text classification. Which again demonstrates
achieves a test accuracy of 0.9812 on R8 with a sliding win- that our method can capture the local discriminative mani-
dow size of 30 which is higher than the TextGCN’s accuracy fold structure of text data and extract richer latent semantic
of 0.9671. information.
On the 20NG dataset, both our proposed model and Text-
GCN achieved their highest test accuracy with a sliding win-
dow size of 30, while the lowest accuracy was observed with

13
2370 B. Wang et al.

Fig. 3  Test accuracy with different sliding window sizes

5.3 Ablation Study discriminative information based on graph local information


is critical to enhancing the performance of text classification.
Our proposed LDGCN utilizes the graph network local infor-
mation preservation method to incorporate both intra-class and
inter-class discriminant information into the TextGCN model. 6 Conclusions
This enables the model to simultaneously maximize the inter-
class discriminant distance while minimizing the cross-entropy In this paper, we propose an effective method for learning
loss, as well as minimize the intra-class discriminant distance, discriminative graph convolutional networks for text clas-
ultimately leading to an improved classification ability. To sification. Our method effectively addresses the challenges
evaluate the effectiveness of our approach, we conducted of intra-class variability and inter-class similarity that arise
ablation studies on five datasets, and report the Recall and in text classification. Specifically, our method, known as
macro-F1 metric values. The experimental settings were con- LDGCN, fully considers the manifold structure of the data,
sistent with those outlined in Section 4.3, and the results are ensuring that the extracted deep features have the smallest
presented in Table 3. Our findings demonstrate that preserving intra-class distance and the largest inter-class distance. By

13
Local discriminative graph convolutional networks for text classification 2371

Fig. 4  Test accuracy by varying training data proportions

Table 3  Recall and Macro-F1 Evaluation standards Model 20NG R8 R52 Ohsumed MR
results for each dataset on
TextGCN and LDGCN Macro-F1 TextGCN 85.57 92.43 65.20 59.09 76.75
LDGCN 87.75 97.33 92.69 59.10 78.22
Recall TextGCN 85.98 92.90 73.80 69.26 76.84
LDGCN 87.97 97.62 95.73 70.94 78.25

Bold represents the best result for each column

combining the classification softmax loss, LDGCN can not and inter-class manifold structures to enhance discrimina-
only learn representative text embeddings and optimize clas- tive power.
sification error, but also leverage the underlying intra-class

13
2372 B. Wang et al.

Acknowledgements This work is supported by the National Key recurrent neural networks. Adv. Neural Inf. Process. Syst. 30
Research and Development Program of China (No. 2022YFC3301801), (2017)
the Fundamental Research Funds for the Central Universities (No. 16. Yin, W., Schütze, H., Xiang, B., Zhou, B.: Abcnn: attention-based
DUT22ZD205). convolutional neural network for modeling sentence pairs. Trans.
Assoc. Comput. Linguist. 4, 259–272 (2016)
17. Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text clas-
sification with multi-task learning. In: Proceedings of the Twenty-
References Fifth International Joint Conference on Artificial Intelligence, pp.
2873–2879 (2016)
1. Phan, H.T., Nguyen, N.T., Hwang, D.: Aspect-level sentiment 18. Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep
analysis: a survey of graph convolutional network methods. Inf. bidirectional transformers for language understanding. In: Pro-
Fus. 91, 149–172 (2023) ceedings of NAACL-HLT, pp. 4171–4186 (2019)
2. Parlak, B., Uysal, A.K.: A novel filter feature selection method 19. Oh, S.H., Kang, M., Lee, Y.: Protected health information recog-
for text classification: extensive feature selector. J. Inf. Sci. nition by fine-tuning a pre-training transformer model. Healthc.
49(1), 59–78 (2023) Inform. Res. 28(1), 16–24 (2022)
3. Rao, S., Verma, A.K., Bhatia, T.: A review on social spam 20. Wu, L., Chen, Y., Shen, K., Guo, X., Gao, H., Li, S., Pei, J., Long,
detection: challenges, open issues, and future directions. Expert B.: Graph neural networks for natural language processing: a sur-
Syst. Appl. 186, 115742 (2021) vey. Found. Trends Mach. Learn. 16(2), 119–328 (2023)
4. Chen, Y.: Convolutional neural network for sentence classifica- 21. Wu, J., Zhang, C., Liu, Z., Zhang, E., Wilson, S., Zhang, C.:
tion. Master’s thesis, University of Waterloo (2015) Graphbert: Bridging graph and text for malicious behavior detec-
5. Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., Xu, B.: Text clas- tion on social media. In: 2022 IEEE International Conference on
sification improved by integrating bidirectional lstm with two- Data Mining (ICDM), pp. 548–557 (2022). IEEE
dimensional max pooling. arXiv preprint arXiv:​1 611.​0 6639 22. Yang, Y., Miao, R., Wang, Y., Wang, X.: Contrastive graph convo-
(2016) lutional networks with adaptive augmentation for text classifica-
6. Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., Song, Y., tion. Inf. Process. Manag. 59(4), 102946 (2022)
Yang, Q.: Large-scale hierarchical text classification with recur- 23. Krishnaveni, P., Balasundaram, S.: Generating fuzzy graph based
sively regularized deep graph-cnn. In: Proceedings of the 2018 multi-document summary of text based learning materials. Expert
World Wide Web Conference, pp. 1063–1072 (2018) Syst. Appl. 214, 119165 (2023)
7. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for 24. Kipf, T.N., Welling, M.: Semi-supervised classification with graph
text classification. In: Proceedings of the AAAI Conference on convolutional networks. In: International Conference on Learning
Artificial Intelligence, vol. 33, pp. 7370–7377 (2019) Representations
8. Vashishth, S., Bhandari, M., Yadav, P., Rai, P., Bhattacharyya, 25. Cao, Y., Liu, Z., Li, C., Li, J., Chua, T.-S.: Multi-channel graph
C., Talukdar, P.: Incorporating syntactic and semantic informa- neural network for entity alignment. In: Proceedings of the 57th
tion in word embeddings using graph convolutional networks. Annual Meeting of the Association for Computational Linguistics,
In: Proceedings of the 57th Annual Meeting of the Association pp. 1452–1461 (2019)
for Computational Linguistics, pp. 3308–3318 (2019) 26. Dai, Y., Shou, L., Gong, M., Xia, X., Kang, Z., Xu, Z., Jiang, D.:
9. Liu, X., You, X., Zhang, X., Wu, J., Lv, P.: Tensor graph con- Graph fusion network for text classification. Knowl. Based Syst.
volutional networks for text classification. In: Proceedings of 236, 107659 (2022)
the AAAI Conference on Artificial Intelligence, vol. 34, pp. 27. Zhu, X., Zhu, L., Guo, J., Liang, S., Dietze, S.: Gl-gcn: global and
8409–8416 (2020) local dependency guided graph convolutional networks for aspect-
10. Ragesh, R., Sellamanickam, S., Iyer, A., Bairi, R., Lingam, V.: based sentiment classification. Expert Syst. Appl. 186, 115712
Hetegcn: heterogeneous graph convolutional networks for text (2021)
classification. In: Proceedings of the 14th ACM International 28. Jin, D., Song, X., Yu, Z., Liu, Z., Zhang, H., Cheng, Z., Han, J.:
Conference on Web Search and Data Mining, pp. 860–868 Bite-gcn: A new gcn architecture via bidirectional convolution of
(2021) topology and features on text-rich networks. In: Proceedings of
11. Liu, Y., Guan, R., Giunchiglia, F., Liang, Y., Feng, X.: Deep atten- the 14th ACM International Conference on Web Search and Data
tion diffusion graph neural networks for text classification. In: Mining, pp. 157–165 (2021)
Proceedings of the 2021 Conference on Empirical Methods in 29. Jin, T., Cao, L., Zhang, B., Sun, X., Deng, C., Ji, R.: Hypergraph
Natural Language Processing, pp. 8142–8152 (2021) induced convolutional manifold networks. In: IJCAI, pp. 2670–
12. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional 2676 (2019)
networks for text classification. Adv. Neural Inf. Process. Syst. 30. Deng, Y., Yang, J., Xiang, J., Tong, X.: Gram: Generative radiance
28 (2015) manifolds for 3d-aware image generation. In: Proceedings of the
13. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic repre- IEEE/CVF Conference on Computer Vision and Pattern Recogni-
sentations from tree-structured long short-term memory networks. tion, pp. 10673–10683 (2022)
In: Proceedings of the 53rd Annual Meeting of the Association for 31. Vepakomma, P., Balla, J., Raskar, R.: Privatemail: Supervised
Computational Linguistics and the 7th International Joint Confer- manifold learning of deep features with privacy for image
ence on Natural Language Processing (Volume 1: Long Papers), retrieval. In: Proceedings of the AAAI Conference on Artificial
pp. 1556–1566 (2015) Intelligence, vol. 36, pp. 8503–8511 (2022)
14. Campos Camunez, V., Jou, B., Giró Nieto, X., Torres Viñals, J., 32. Sugiyama, M.: Dimensionality reduction of multimodal labeled
Chang, S.-F.: Skip rnn: learning to skip state updates in recurrent data by local fisher discriminant analysis. J. Mach. Learn. Res.
neural networks. In: Sixth International Conference on Learn- 8(5) (2007)
ing Representations: Monday April 30-Thursday May 03, 2018, 33. Le, Q., Mikolov, T.: Distributed representations of sentences and
Vancouver Convention Center, Vancouver:[proceedings], pp. 1–17 documents. In: International Conference on Machine Learning,
(2018) pp. 1188–1196 (2014). PMLR
15. Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., Cui, 34. Joulin, A., Grave, É., Bojanowski, P., Mikolov, T.: Bag of tricks for
X., Witbrock, M., Hasegawa-Johnson, M.A., Huang, T.S.: Dilated efficient text classification. In: Proceedings of the 15th Conference

13
Local discriminative graph convolutional networks for text classification 2373

of the European Chapter of the Association for Computational Conference on Empirical Methods in Natural Language Process-
Linguistics: Volume 2, Short Papers, pp. 427–431 (2017) ing and the 9th International Joint Conference on Natural Lan-
35. Tang, J., Qu, M., Mei, Q.: Pte: Predictive text embedding through guage Processing (EMNLP-IJCNLP), pp. 3444–3450 (2019)
large-scale heterogeneous text networks. In: Proceedings of the 40. Zhang, C., Zhu, H., Peng, X., Wu, J., Xu, K.: Hierarchical infor-
21th ACM SIGKDD International Conference on Knowledge Dis- mation matters: Text classification via tree based graph neural
covery and Data Mining, pp. 1165–1174 (2015) network. In: Proceedings of the 29th International Conference on
36. Shen, D., Wang, G., Wang, W., Min, M.R., Su, Q., Zhang, Y., Li, Computational Linguistics, pp. 950–959 (2022)
C., Henao, R., Carin, L.: Baseline needs more love: On simple 41. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization.
word-embedding-based models and associated pooling mecha- arXiv preprint arXiv:​1412.​6980 (2014)
nisms. In: Proceedings of the 56th Annual Meeting of the Asso- 42. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors
ciation for Computational Linguistics (Volume 1: Long Papers), for word representation. In: Proceedings of the 2014 Conference
pp. 440–450 (2018) on Empirical Methods in Natural Language Processing (EMNLP),
37. Wang, G., Li, C., Wang, W., Zhang, Y., Shen, D., Zhang, X., pp. 1532–1543 (2014)
Henao, R., Carin, L.: Joint embedding of words and labels for
text classification. In: Proceedings of the 56th Annual Meeting of Publisher's Note Springer Nature remains neutral with regard to
the Association for Computational Linguistics (Volume 1: Long jurisdictional claims in published maps and institutional affiliations.
Papers), pp. 2321–2331 (2018)
38. Liu, T., Zhang, X., Zhou, W., Jia, W.: Neural relation extraction Springer Nature or its licensor (e.g. a society or other partner) holds
via inner-sentence noise reduction and transfer learning. In: Pro- exclusive rights to this article under a publishing agreement with the
ceedings of the 2018 Conference on Empirical Methods in Natural author(s) or other rightsholder(s); author self-archiving of the accepted
Language Processing, pp. 2195–2204 (2018) manuscript version of this article is solely governed by the terms of
39. Huang, L., Ma, D., Li, S., Zhang, X., Wang, H.: Text level graph such publishing agreement and applicable law.
neural network for text classification. In: Proceedings of the 2019

13

You might also like