0% found this document useful (0 votes)
25 views10 pages

Dynamic Embedding Projection-Gated

Uploaded by

morteza.digitoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Dynamic Embedding Projection-Gated

Uploaded by

morteza.digitoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO.

3, MARCH 2022 973

Dynamic Embedding Projection-Gated


Convolutional Neural Networks
for Text Classification
Zhipeng Tan , Jing Chen , Qi Kang , Senior Member, IEEE, MengChu Zhou , Fellow, IEEE,
Abdullah Abusorrah , Senior Member, IEEE, and Khaled Sedraoui

Abstract— Text classification is a fundamental and important detection [3], sentiment analysis [4], and spam filtering [5].
area of natural language processing for assigning a text into at Unstructured text-based data are one of the forms of infor-
least one predefined tag or category according to its content. Most mation appearing in social media, e-mail, and web pages.
of the advanced systems are either too simple to get high accu-
racy or centered on using complex structures to capture the gen- It often takes much time to extract useful information from
uinely required category information, which requires long time such text because of its unstructured nature [6]. Doing so
to converge during their training stage. In order to address such has been prohibitively expensive and challenging because
challenging issues, we propose a dynamic embedding projection- it takes lots of resources and time to create needed hand-
gated convolutional neural network (DEP-CNN) for multi-class crafted rules. Recently, a good choice for constructing text
and multi-label text classification. Its dynamic embedding pro-
jection gate (DEPG) transforms and carries word information by data is to use NLP instead of traditional feature engineer-
using gating units and shortcut connections to control how much ing, which makes text classification scalable, cost-effective,
context information is incorporated into each specific position of and rapid.
a word-embedding matrix in a text. To our knowledge, we are In earlier years, some machine learning classification
the first to apply DEPG over a word-embedding matrix. The approaches, such as support vector machine (SVM) [7] and
experimental results on four known benchmark datasets display
that DEP-CNN outperforms its recent peers. Naive Bayes [8], mainly extracted required information from
a given text over the preprocessed statistical indicators of bag
Index Terms— Convolutional neural network (CNN), dynamic of words, such as term frequency-inverse document frequency
embedding projection gate, multi-class and multi-label text clas-
sification, natural language processing (NLP). weighting [9]. Yet, their weighting schemes ignore the word
position order presented in a text. To solve this problem,
Silva et al. [10] used the SVM as a classifier to construct
I. I NTRODUCTION
an optimal segmentation of an eigenspace [11] based on the

T EXT classification is highly useful [1] in natural lan-


guage processing (NLP) and plays a vital role in vari-
ous applications, for instance, topic categorization [2], intent
extraction of n-gram features, thus achieving good results.
It implies that understanding the meaning of each word or
n-gram in a text is a necessary step in capturing more
Manuscript received October 24, 2019; revised April 22, 2020; accepted sophisticated semantics. Although being simple and effective,
November 2, 2020. Date of publication January 8, 2021; date of current it suffers from data sparsity problems [12]. Such traditional
version March 1, 2022. This work was supported in part by the Natural word representation forms cannot allow one to obtain syntactic
Science Foundation of China under Grant 51775385 and Grant 61703279, in
part by the Strategy Research Project of Artificial Intelligence Algorithms of and semantic information of a given text.
Ministry of Education of China, in part by the Fundamental Research Funds Recently, neural networks provide an effective way of
for the Central Universities, and in part by the Deanship of Scientific Research modeling a text and have become increasingly popular in text
(DSR) at King Abdulaziz University, Jeddah, Saudi Arabia, under Grant RG-
48-135-40. (Corresponding author: Qi Kang.) categorization while promoting the development of distributed
Zhipeng Tan, Jing Chen, and Qi Kang are with the Department of Control representations of words at the same time. Word2vec [13],
Science and Engineering, College of Electronics and Information Engineering, GloVe [14], and fastText [15] are the related products, which
Tongji University, Shanghai 201804, China, and also with the Shanghai
Institute of Intelligent Science and Technology, Tongji University, Shang- can map words to low-dimensional spaces with no need for
hai 200092, China (e-mail: [email protected]; [email protected]; feature filtering, and have received great attention as the input
[email protected]). of text classification algorithms. Many approaches construct
MengChu Zhou is with the Department of Electrical and Computer Engi-
neering, New Jersey Institute of Technology, Newark, NJ 07102 USA, and a variety of architectures for text classification by taking the
also with the Center of Research Excellence in Renewable Energy and Power advantages of distributed representations of words.
Systems, King Abdulaziz University, Jeddah 21481, Saudi Arabia (e-mail: One of the dominant neural networks in recent years is long
[email protected]).
Abdullah Abusorrah and Khaled Sedraoui are with the Department of short-term memory (LSTM) networks [16]–[19]. It can capture
Electrical and Computer Engineering, King Abdulaziz University, Jeddah the current information and memorize the previous data points
21481, Saudi Arabia, and also with the Center of Research Excellence in in a sequence. A Bi-LSTM architecture with appropriate
Renewable Energy and Power Systems, King Abdulaziz University, Jeddah
21481, Saudi Arabia (e-mail: [email protected]; [email protected]). regularization can yield high accuracy and F1 score [16]. Yang
Digital Object Identifier 10.1109/TNNLS.2020.3036192 et al. [17] built a sequence generation model (SGM) with a
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
974 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 3, MARCH 2022

novel decoder structure through LSTM to conduct a multi- for capturing meaningful features given a text according to a
label classification task. Besides, LSTM with an attention visualization experiment. The novelty and main contributions
mechanism has been successfully used in text classification, of this article are as follows.
such as Attn-LSTM [18] and hierarchical attention network 1) Based on the highway network, we construct a dynamic
(HAN) [19]. However, these LSTM-based models are very embedding projection-gated layer, in which DEPG
time-consuming in training, as LSTM is characterized as one replaces the affine transformation of highway network
token in a step. Moreover, LSTM with an attention mechanism with convolution transformation.
involves the exponential function and normalization of all 2) This work represents the first work to apply DEPG over
alignment scores of all the words in a text, which causes extra a word-embedding matrix such that gating units and
computational burden [20]. shortcut connections are well-used to control how much
As another dominant neural network, convolutional neural context information is incorporated into each specific
networks (CNNs) [21] are more efficient and versatile because position of a word-embedding matrix given a text.
they can be well-parallelized on graphics processing units 3) The results show that the proposed method outperforms
(GPUs). Kim [22] proposed a strong baseline for sen- its peers via several existing well-known benchmarks.
tence classification through some simple CNN structures and An overview of the related work is given in Section II.
achieves better results by using pre-trained word vectors rather Section III details DEP-CNN. Section IV reports the experi-
than random initialization. Many improved model structures mental results. Finally, Section V concludes this work.
[23], [24] are based on it. Liu et al. [23] adopted a dynamic
max-pooling scheme instead of max over time pooling and
II. R ELATED W ORK
added an additional fully connected layer, tailored for multi-
label classification in particular, but its scheme is sophisticated. A. Gating Units
Zhang et al. [24] provided a CNN architecture entirely based For a long time, people have studied the practices
on the character-level information. The model can achieve and theories of gating units [29]–[31]. Schmidhuber and
state-of-the-art results only when a training dataset is large Hochreiter [29] first applied them to recurrent neural
enough because it ignores the word-level meaning required to networks (RNNs) [32], constituting a novel architecture
cope with small-scale datasets. Besides, a simple combination called LSTM that can effectively alleviate vanishing or
of LSTM and CNN performs well in sentiment analysis exploding gradient problems. RNN with gating units can
[25]. Moreover, reasonable data augmentation techniques are selectively forget or retain the information of the previous
helpful to improve the performance of the model [26]. Other time steps, playing an important role for RNN to reach
novel neural network methods, such as the one in [27], explore state-of-the-art performance [16], [33], [34].
capsule networks with dynamic routing for text classification. The gating units are also used in the field of CNNs and
While most of these architectures have achieved excellent have been proved to be effective [30], [31]. The work [30]
performance, they have some limitations. Complex architec- applies gated tanh units (GTUs) to a CNN-based model
turesin, e.g., [17], are challenging to train, sensitive to hyperpa- for exploring the conditional image generation where GTU
rameters, and even slow to predict, making it harder to deploy controls the features of the generated image pixels. Another
such networks to solve real-world problems. Due to high time popular method, called gated CNN [31], uses gated linear units
dependence (e.g., the state at the current time step depends (GLUs) instead of GTU to train a language model. GLU can
on the previous one), LSTM-based models (such as [16]) are mitigate the vanishing or exploding gradient problem for many
hardly parallelized. They may be biased when the following stacked convolutional blocks by providing a linear path for the
words are more influential than the earlier ones. Despite CNN- gradients while retaining nonlinearity. However, a hierarchical
based models (such as [22]) that are simple and can be CNN structure with gating units is limited in capturing con-
easily parallelized, they cannot discriminate whether context textual information of a word within the whole text because
information should carry or not, which leads to relatively low its text structure depends on a feedforward hierarchical path.
accuracy.
In order to tackle the problems discussed earlier, we con- B. Shortcut Connections
struct a dynamic embedding projection gate (DEPG) layer Shortcut connections refer to those in which input signals
after a word-embedding matrix for transforming and carrying skip over one or more weight layers to accelerate the learning
word information by using gating units and shortcut connec- speed of neural network systems [35]. Their research had a
tions to control how much context information is incorporated long history since the idea was given in [36]. Decentralization
into each specific position of a word-embedding matrix in a is one of the characteristics of shortcut connections. Schrau-
text. Inspired by the highway network [28], we use DEPG dolph [37] extended this idea to the reverse propagation of
to replace affine transformation of highway network with gradients and pointed out that not only the activation values of
convolution transformation to obtain the genuinely required input and hidden layer units should be centralized but also the
context information. We conduct a series of experiments with gradient errors and weight updates can be centralized, which
dynamic embedding projection-gated convolutional neural net- is a shortcut connection between inputs and outputs.
work (DEP-CNN) that our method outperforms other strong From the above discussion, we realize that gating units
baselines on two multi-class and two multi-label text classifi- have an excellent ability to control the information flow, and
cation benchmarks. Besides, DEPG is verified to be necessary shortcut connections play a vital role in easing the problem of

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DEP-CNNs FOR TEXT CLASSIFICATION 975

Fig. 1. Illustration of DEP-CNN model architecture for ultimately assigning one of two labels to the given text in multi-class classification. Gc and Gt
depicted refer to carry gate and transform gate, respectively.

vanishing or exploding gradients. Considering the advantages III. DEP-CNN M ODEL


of gating units and shortcut connections, Srivastava et al. In this section, we propose a novel text classification archi-
[28] proposed a structure called highway network, allowing tecture, namely, DEP-CNN, combining the advantages of the
for computation paths along which the information can flow methods in [22] and [38]. DEP-CNN is more efficient than
across many layers without attenuation. Srivastava et al. models based on LSTMs. In comparison with standard CNNs,
[38] further expanded the highway network to train deeper it can capture the context meaning of words in the whole text
neural networks for gaining high performance in some image better.
datasets. As the highway network has shown good results in
the field of computer vision, we try to migrate it to the field A. Standard CNN
of NLP to see whether it can improve the performance of a
We first briefly explain the elements in CNN [22] as
CNN model. Kim et al. [39] proved that applying a highway
preliminaries for our study. The right of Fig. 1 shows the main
network to a language model can achieve outstanding results.
architecture of a standard CNN. For ease of understanding,
In their language model, they first employ CNN to extract text
we use a single example to describe it in detail.
information, then use its output as the input to the highway
Let a text with length n(e.g., n = 5 in Fig. 1) be expressed
network, and finally apply LSTM to obtain the results. Since
as s = [w1 , . . . , wn ]. Each word in it is mapped to its corre-
our task is multi-class and multi-label text classification rather
sponding word vector via pre-trained word vectors, yielding
than a language model, we have made some changes to
X = [x 1 , . . . , x n ; x i ∈ Re ], where e is the embedding size
the highway network. On the one hand, we replace affine
(e.g., e = 3 in Fig. 1). Kim [22] used a standard padding
transformation with convolution one in highway network to
operation to adjust an input text to the maximum length
obtain the genuinely required context information. On the
of all texts, which ultimately affects the performance of a
other hand, we change the location of highway network by
model due to the introduction of severe input noise. Instead,
applying it to a word-embedding matrix to retrieve pre-trained
we dynamically adjust the text of each batch to a fixed length
vocabulary information.
for alleviating this undesirable phenomenon. Given an input
We call the modified highway network as DEPG, where
word-embedding matrix X,1-D convolution with l filters (e.g.,
“Dynamic” means the continuous update of projection gate
l = 2 in Fig. 1) of different kernel size k(e.g., k1 = 3, k2 =
parameters during training and “Embedding” means a word-
4, and k3 = 5 in Fig. 1) is used to extract n-gram features
embedding matrix. In a word, DEPG can make the whole
at different granularities over X at each position individually.
model capable of selectively controlling the flow of context
A convolutional filter Wc ∈ Rk1 ×e is applied to a region of k1
information for shallow networks in NLP tasks. It is worth
words to produce a single abstract feature ci ∈ R
mentioning that this is the first work to propose and apply
DEPG over a word-embedding matrix. ci = f (g(Wc  Xi:i+k1 ) + bc ) (1)

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
976 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 3, MARCH 2022

where bc ∈ R is a bias term,  is elementwise multiplication Algorithm 1 DEP-CNN for Multi-Class Text Classification
of two matrices, g(·) is the function for summing up all Input: Text with n words: s = [w1 , . . . , wn ]; Pre-trained
elements of a matrix, and f (·) represents ReLU [40] that is a word-embedding matrix P for words in vocabulary set V Out
nonlinear activation function. of vocabulary set Vun ; Filter count h of DEPG; Kernel size
We slide the filter across the whole text to produce a feature list k and filter count l of standard CNN.
map c = [c1 , c2 , . . . , cn−k1 +1 ]T ∈ R(n−k1 +1)×1 . Because there Output: The category of text s
are l filters, the output of the convolution is Ck1 ∈ R(n−k1 +1)×l .
For each filter, we apply the max over time pooling operation 1: Initialize each word in Vun using random vector, and append
over the corresponding feature map to obtain a fixed-size the vector to P. Randomly initialize all the weight matrix W
vector m ∈ R1×l . Also, for each kernel size (e.g., k1 = and bias b in DEP-CNN.
3, k2 = 4, and k3 = 5 in Fig. 1), we can get three fixed-size 2: For i in range (n) do
vectors and then concatenate them to form the final feature 3: Get wi ’s word vector x i from pre-trained
representation of an input text. word-embedding matrix P to form the input
word-embedding matrix X.
4: End For
B. DEPG 5: i = 0
The left of Fig. 1 shows the mechanism of DEPG. We add 6: while i < n do
it between input embedding matrix X and standard CNN to 7: For j in range (h) do
form our new architecture. DEPG is given as 8: Calculate word wi ’s the j th dimensional transform
ij ij
gate Gt and carry gate Gc
Gt = σ (XWproj + bproj ) (2) 9: Pad X to remain its size unchanged.
Gc = 1 − Gt (3) 10: Calculate converted information Ti j for word wi ,
using convolutional operation with the j th filter.
where Gt ∈ Rn×e represents the transform gate and Gc ∈
11: Get word wi ’s the j th dimensional context feature
Rn×e represents the carry gate. The elements of weight√ matrix√ ij ij
Ei j by calculating Ti j  Gt + Xi j  Gc ∈ R
Wproj ∈ Re×e are sampled uniformly from [−1/ e, 1/ e],
12: End For
while those of bias vector bproj ∈ Re are initialized to 0. σ (·) is
13: Obtain wi ’s context representation Ei = [· · · ,Ei j ,· · · ]T .
a sigmoid function [41] mapping all input elements to a range
14: i = i + 1
of 0–1. Note that we have tried to add positional encoding
15: End While
representing word order information into a word-embedding
16 : Obtain context word-embedding matrix
matrix, but it reduces the performance of our model.
E = [· · · , Ei , · · · ]T
The context information is encoded by 1-D convolution with
17: For k in k do
h filters (e.g., h = 3 in Fig. 1) of kernel size r (e.g., r = 3
18: Get feature map Ck ∈ R(n−k+1)×l of the k th kernel
in Fig. 1). The number of filters h should be consistent with
size with all filters l after convolutional operation
embedding size e. We define context information for a specific
on E.
word i as
19: Capture the most important information mk ∈ R1×l
ti = [t1 , t2 , . . . , th ] (4) of Ck by a max pooling operation.
t j = f (g(W j  Xi:i+r ) + b j ) j = 1, 2, . . . , h (5) 20: End For
21: Concatenate the output of the pooling layer to get
where ti ∈ R1×h represents a word vector that integrates local [· · · , mk · · · ], which is then fed to the dropout layer and
context information, t j ∈ R is the result of the convolution next fed to the fully connected layer to get probability
operation of the j th convolutional filter W j ∈ Rr×e . b j ∈ R is distribution p.
a bias term. Therefore, the converted word-embedding matrix 22: Return The index corresponding to the maximum value
is given as of p.

T = [t1 , t2 , . . . , tn ]T (6)
layer to make adaptive adjustments for facilitating and regu-
where T ∈ Rn×e .
lating their fusion. The new word-embedding matrix E ∈ Rn×e
It is worth noting that we need to ensure that the size of
is given as
word-embedding matrix remains unchanged after the convo-
lution operation in (5). Therefore, the number of paddings is E = T  Gt + X  Gc . (8)
(n − 1)s + k − n We replace X with E as input to standard CNN to obtain
p= (7)
2 text feature representations. These features are next fed to a
where s defaults to 1 over here, denoting the stride of the 1-D classifier that computes class probabilities.
convolution.
Finally, we combine the original word-embedding matrix C. Loss Function
X with the converted T through DEPG, reasonably choosing For multi-label text classification task, our model is trained
words and their local context information within the same by minimizing the binary cross-entropy objective over a

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DEP-CNNs FOR TEXT CLASSIFICATION 977

TABLE I presented in this pre-trained set are initialized uniformly from


S UMMARY OF THE D ATASETS A FTER TOKENIZATION . L ABEL [−0.25, 0.25]. The word embeddings are fine-tuned during
T YPE : M ULTI -C LASS I NDICATES O NLY O NE C ATEGORY L ABEL ;
M ULTI -L ABEL I NDICATES M ORE T HAN O NE R ELEVANT L ABEL .
training on all datasets except Reuters since CUDA runs out of
T RAIN /VAL ./T EST: N UMBER OF S AMPLES FOR T RAIN , memory. Our model is using the Adam optimization algorithm
VALIDATION , AND T EST, R ESPECTIVELY. C: N UMBER OF as a stochastic gradient descent update rule with a learning rate
TARGET C ATEGORIES . L: AVERAGE T EXT L ENGTH
of 1e−3 and a batch size of 64 and 32 for multi-class and multi-
label text classification, respectively. We also add temporal
averaging (such as [16]) to reduce the generalization error
and random noise of the model [43]. We set the convolutional
flow filter size in the dynamic embedding projection layer to
3 with 300 feature maps and use the filter kernel sizes of 3,
4, and 5 with 512 feature maps for standard CNN. A dropout
with rate of 0.5 [44] is used over the penultimate layer for
sigmoid activation function. Also, for multi-class text clas- alleviating overfitting. We implement our experiments on RTX
sification task, it is trained by minimizing the cross-entropy 2080Ti GPU with PyTorch 0.4.1. For a fair comparison, all
objective over a softmax activation function. Both objectives experiments are conducted in the same deep learning platform
are minimized between the ground truth and the estimated as the work [16]. LR and SVM use the default setting in Scikit-
probability for all training samples. They can be formulated learn1 for getting the accuracy on AG News.
as
1           C. Experimental Results
m c
min − yi j log σ z i j + 1− yi j log 1−σ z i j In this section, we evaluate and compare DEP-CNN with
θ m i=1 j =1
several other state-of-the-art models on four datasets and then
(9)
    make a detailed analysis according to the experimental results.
1 
m c
exp z i j 1) Compared Methods: Traditional LR and SVM classifiers,
min − yi j log c (10)
k=1 exp(z ik )
θ m i=1 j =1 Attention-Based LSTM (Attn-LSTM) [18], LSTM regularized
by weight and embedding dropout (Reg-LSTM) [16], standard
where i is the index of a training sample, j is the index of a CNN for sentence classification (KimCNN) [22], a variant
class label, z i j is the output element of the final fully connected of KimCNN for extreme multi-label task (XML-CNN) [23],
layer, and yi j is the ground truth of the i th training sample HAN [19], encoder–decoder SGM [17], character-level input
of the j th class label. We give DEP-CNN for multi-class text for CNN (Char-CNN) [24], combined LSTM-CNN model
classification in Algorithm 1. (LSTM-CNN) [25], and capsule neural network (Capsule-Net)
[27]. Note that we do not compare DEP-CNN with bidirec-
IV. E XPERIMENTS AND R ESULTS
tional encoder representations from transformers (BERT) [45]
A. Datasets in this work. This is because BERT is a two-stage (pre-training
The experiments are conducted on four standard benchmark and fine-tuning) model, which needs a huge amount of data
datasets to test the performance of our method. and computation on a pre-training stage, e.g., it takes one
1) IMDB (Internet Movie Multi-Class Dataset for Predict- month to train 40 epochs of BERT based on a dataset of
ing Movie Rating Given a Sample): We use the splits 800 million words with eight pieces of P100 [45]. However,
from [16]. DEP-CNN and its peers in this work have only one stage and
2) AG News (News Topic Multi-Class Dataset): We ran- do not rely on a large dataset.
domly assign 10% of the training set as a validation set 2) Results and Analysis: Following [16], we adopt the
[24]. mean and standard deviation (SD) of the micro average
3) AAPD (ArXiv Academic Paper Multi-Label Dataset): F1 score and accuracy for five runs with different seeds as
It contains abstracts and topics from 55 840 academic the evaluation metrics for multi-class and multi-label text
papers in the field of science with the aim to predict classification, respectively. The performance of all methods
the relevant topics of an academic paper based on its is shown in Table II.
abstract. The author-defined splits [17] are used in this The results show that DEP-CNN can achieve the best test
work. scores among all 12 models for all four datasets, which
4) Reuters-21578 (Financial Newswire Service Multi-Label verifies its effectiveness. In particular, although its validation
Dataset With 10 788 Documents): The standard splits scores are lower than that of the model ranked second in the
in [42] are used in this article. The statistics for each test scores on IMDB and AAPD, its test scores are closer
dataset are shown in Table I. to validation scores. Specifically, the margins that validation
scores minus test scores on IMDB and AAPD are 0.3%
B. Parameter Setup and 1.2%, respectively. The SDs of the test scores of our
model after five runs with different seeds on IMDB, AG
In the experiment, we initialize word embeddings by using
News, AAPD, and Reuters datasets are 0.2, 0.1, 0.2, and 0.2,
publicly available word vectors with 300 dimensions pre-
trained on Google News proposed in [13], and the words not 1 https://scikit-learn.org/stable/index.html

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
978 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 3, MARCH 2022

TABLE II
R ESULTS OF O UR DEP-CNN A GAINST O THER M ODELS ON THE VALIDATION AND T EST S ETS . A CC . (A CCURACY ) AND F1 (F1 S CORE ) R EPRESENT
THE M EASUREMENT OF M ULTI -C LASS AND M ULTI -L ABEL , R ESPECTIVELY. M ± SD D ENOTES THE M EAN AND SD FOR F IVE R UNS W ITH
D IFFERENT S EEDS . ∗ R EFERS TO THE R ESULTS F ROM [16], & R EFERS TO O UR E XTRA E XPERIMENTAL R ESULTS U SING [16] A CCORDING
TO THE O RIGINAL PAPER , $ D ENOTES THE R ESULTS F ROM THE O RIGINAL PAPER , AND # D ENOTES THE R ESULTS F ROM O UR
R EIMPLEMENTATIONS A CCORDING TO THE O RIGINAL PAPER

respectively, which are the smallest in comparison with other TABLE III
deep learning models. P-VALUES OF DEP-CNN A GAINST O THER M ODELS U NDER T-T EST
Interestingly, on the AAPD and Reuters datasets, the per-
formance of traditional SVM appears better than other strong
neural net models (e.g., KimCNN, Attn-LSTM, and HAN). It
is more suitable for multi-label classification compared with
LR baseline. On the other hand, although LR outperforms
SVM on IMDB, it has lower test scores by up to 0.5%
compared with SVM on AG News, which contradicts the
conclusion in [16] that LR appears suited for multi-class
datasets. The reason for this may be that our selected function
in Scikit-learn differs from that in [16] or the statistical
information of our tested datasets varies.
KimCNN has the worst performance among all neural
networks on IMDB and AAPD because of its weak capability TABLE IV
of capturing a long text. In addition to DEP-CNN, the complex T IME TO C ONVERGE IN S ECONDS ON F OUR T EXT C LASSIFICATION
D ATASETS FOR A LL N EURAL M ETHODS E XCEPT SGM AND C APSULE -
SGM achieves the best performance on AAPD. Also, on AG N ET. T HESE M ETHODS A RE T RAINED IN THE S AME E XPERIMEN -
News, Capsule-Net defeats all the methods except for DEP- TAL E NVIRONMENT. # D ENOTES C UDA O UT OF M EMORY
CNN. It is worth noting that Reg-LSTM has very high scores
on IMDB and Reuters. All the results of our model are close
to 52.8% and 87.0% on the test sets of IMDB and Reuters,
respectively, and the best result is slightly higher than Reg-
LSTM by 0.2%. Nevertheless, the convergence time of our
model on all datasets is shorter than Reg-LSTM, and especially
on Reuters, our model takes less than half of its time.
P-values of DEP-CNN against other models under t-test
are shown in Table III. In this experiment, we set the degree
of freedom to 4 and α = 0.05 to conduct the two-sided test.
According to Table III, the improvement brought by DEP-
CNN over its compared models is statistically significant with
p < 0.05 under t-test except for Reg-LSTM on IMDB and
Reuters, and LSTM-CNN on AAPD and Reuters. SGM and Capsule Net. From this table, we can see that DEP-
The training convergence time of neural methods on CNN has the shortest training time on the AG datasets and
two multi-class and two multi-label datasets is recorded ranks second, second, and third in IMDB, AAPD, and Reuters,
in Table IV. We fail to migrate the authors’ code to the same respectively. In these three datasets, DEP-CNN is worse than
experimental environment to report the convergence time of KimCNN in terms of training time. However, according to the

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DEP-CNNs FOR TEXT CLASSIFICATION 979

number of categories may lead to this because, as it increases


(according to Table I, the number of categories in AG News
is 4, while that in other datasets are 10, 54, and 90), the text
classification task becomes more difficult, making the effect
of DEPG more obvious.

E. Study of Various Combinations


As far as we know, most of the current workplaces are
the location of a highway network over CNN [28], [38],
[39], [46]. We conjecture that changing the order of their
position may improve classification scores. Therefore, to verify
it, we have carried out a series of experiments with results
shown in Table VI. We can see that HW-CNN has almost
the same results as CNN-HW on IMDB and AG News but
better than it on AAPD and Reuters. It is worth mentioning
that this is the first work to apply a highway network over
a word-embedding matrix and then feed the outputs to CNN.
In addition, considering that convolution transformation can
obtain local context information of a word-embedding matrix,
we try to replace affine transformation with convolution one in
a highway network. The modified highway network is named
Fig. 2. Study on the influence of DEPG. Accuracy measures the multi- DEPG by us. We find that DEP-CNN is better than CNN-DEP
class text classification. F1 score measures the multi-label text classification.
(a) Multi-class datasets. (b) Multi-label datasets. on all datasets, which illustrates the effectiveness of placing
DEPG over a word-embedding matrix.
In Rows 2 and 4 of Table VI, DEP-CNN outperforms HW-
previous analysis, we know that KimCNN is far from us in CNN on IMDB, AG News, and AAPD but behaves slightly
scores. We also find that CNN-based models except for Char- worse than HW-CNN on Reuters. Why does DEP-CNN under-
CNN are better than LSTM-based models in terms of training perform HW-CNN on Reuters? We reason that the value of
time as a result of the parallel mechanism of CNN and the category number contained in per word in the average sample
reason why Char-CNN costs much time to converge is that of Reuters is greater (according to Table I, it is 0.70, while
this model needs to start with character-level training from those in other datasets are 0.03, 0.11, and 0.33), signifying that
scratch. Surprisingly, the convergence rate of sophisticated the work on Reuters does not need much context information
HAN on multi-label datasets is faster than that of simple Reg- compared with that on other datasets. In short, except for
LSTM. Moreover, Attn-LSTM achieves a competitive result CNN-HW, the other three models are all constructed by us
of training time with DEP-CNN on AG News and Reuters. for the first time. Considering the performance of CNN-DEP,
DEP-CNN, and HW-CNN, we take DEP-CNN as the final
D. Ablation Study model.
In order to analyze the effect of DEPG for text classification,
the results of DEP-CNN implemented without DEPG are F. Visualization
provided in Table V. Compared with DEP-CNN, the validation In this section, we visualize the input word-embedding
scores of DEP-CNN without DEPG decrease by 1.2%, 1.0%, matrix, the weights generated by the DEPG layer, and the
1.2%, and 0.8% respectively on IMDB, AG News, AAPD, feature maps of the standard CNN layer after the activation of
and Reuters datasets and 1.2%, 0.5%, 0.9%, and 0.8%, respec- ReLU. According to [47], we randomly chose a slightly shorter
tively, in test scores. In a word, DEPG has improved to varying text from the AG News test set as a sample and input the text
degrees on these standard datasets, especially on IMDB, where into the trained DEP-CNN model. We train a small model with
test scores have increased by more than 1%. k = 3 in the dynamic embedding projection layer and r = 3
We have also studied the effect of DEPG during the training in the standard CNN layer. The input text is “Icing call out
process. Fig. 2 shows the test scores of DEP-CNN and DEP- of money out of patience out of time and for the foreseeable
CNN without DEPG, where there are two curves on each future out of business” and its category is “Business.”
dataset and all the models use the same experimental settings. After dimension reduction and normalization, the value of
Observe that the test scores of each model gradually converge each word contained in the text described earlier is visualized
to a stable interval with the increase of time steps. In addition, in Fig. 3. Surprisingly, DEPG can selectively extract words
DEP-CNN takes less time to converge on IMDB, AAPD, and and their context information in the same layer by observing
Reuters, which indicates that DEPG plays a role in alleviating Fig. 3(a) and (b), i.e., it suppresses insignificant word infor-
gradient-based training problems. However, for the AG News mation (e.g., “and,” “for,” and “the”) and retains important
dataset, the convergence time of DEP-CNN without DEPG contextual information (e.g., “Icing call out,” “out of money,”
is almost the same as that of DEP-CNN. We think that the and “out of business”) at the same time. Also, the vital features

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
980 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 3, MARCH 2022

TABLE V
R ESULTS OF THE A BLATION S TUDY FOR DEP-CNN. –DEPG R EFERS TO O UR M ODEL I MPLEMENTED W ITHOUT DEPG

TABLE VI
R ESULTS OF THE C OMBINATION STUDY FOR CNN-HW, HW-CNN, CNN-DEP, AND DEP-CNN. T HE C OMBINED M ODEL OF CNN AND H IGHWAY
N ETWORK I S N AMED CNN-HW. HW-CNN, CNN-DEP, AND DEP-CNN A RE S IMILARLY N AMED

Fig. 3. Visualization of the DEP-CNN model. (a) Input word-embedding matrix. (b) Weights generated by the DEPG layer. (c) Feature maps of the standard
CNN layer after ReLU activation function.

in Fig. 3(b) are enhanced after passing through the standard GloVe, Word2vec, and fastText, we use the vectors learned on
CNN layer. Ultimately, the model focuses on three areas: 840B tokens of web data, 100B tokens of Google News, and
“Icing,” “of money out of patience,” and “foreseeable future 600B tokens of Common Crawl, respectively. Rand represents
of business.” What puzzles us here is why the model gives random initialization, not using pre-trained word vectors. The
much attention to the word “Icing” not related to business. dimension of these vectors is 300. Test scores of DEP-CNN
At first, we think that it is the interference of noise, but we on different pre-trained word vectors are shown in Fig. 4.
find that Icing is the name of a shopping website by consulting Compared with the random initialization, the pre-trained
the information. In conclusion, we believe that DEPG is very word vectors improve the test scores of our DEP-CNN by
helpful in capturing more meaningful features for a given text. 0.82%, 0.61%, 0.83%, and 0.76% on IMDB, AG News,
AAPD, and Reuters, respectively, which shows that pre-trained
G. Effect of Pre-Trained Word Vectors word vectors are indeed helpful to some extent. Therefore,
In the process of model initialization, the input word- it is a favorable choice to use pre-trained word vectors
embedding matrix of a neural network model can be obtained instead of random initialization. Besides, the relative test
by looking up the pre-trained word vectors to represent scores obtained by employing GloVe, Word2vec, and fastText
the given text words, which has become the most common rely on datasets, and trying several different pre-trained word
operation. To explore the influence of pre-trained word vec- vectors may get the better result to a certain extent. However,
tors on the performance of our DEP-CNN, we experiment the results differ by less than 0.25% on these three pre-trained
with GloVe [14], Word2vec [13], fastText [15], and Rand word vectors, which implies that DEP-CNN has high
as distributed representations of words for a given text. For robustness.

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DEP-CNNs FOR TEXT CLASSIFICATION 981

[6] A. McCallum, “Information extraction: Distilling structured data from


unstructured text,” Queue, vol. 3, no. 9, pp. 48–57, 2005.
[7] T. Joachims, “Text categorization with support vector machines: Learn-
ing with many relevant features,” in Proc. Eur. Conf. Mach. Learn.,
1998, pp. 137–142.
[8] A. McCallum and K. Nigam, “A comparison of event models for
naive Bayes text classification,” in Proc. AAAI-98 Workshop Learn. Text
Categorization, 1998, pp. 41–48.
[9] B. Jiang, Z. Li, H. Chen, and A. G. Cohn, “Latent topic text representa-
tion learning on statistical manifolds,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 29, no. 11, pp. 5643–5654, Nov. 2018.
[10] J. Silva, L. Coheur, A. C. Mendes, and A. Wichert, “From symbolic to
sub-symbolic information in question classification,” Artif. Intell. Rev.,
vol. 35, no. 2, pp. 137–154, Feb. 2011.
[11] Q. Kang, L. Shi, M. Zhou, X. Wang, Q. Wu, and Z. Wei, “A distance-
based weighted undersampling scheme for support vector machines and
its application to imbalanced classification,” IEEE Trans. Neural Netw.
Learn. Syst., vol. 29, no. 9, pp. 4152–4165, Sep. 2018.
[12] W. Liang, R. Feng, X. Liu, Y. Li, and X. Zhang, “GLTM: A global and
local word embedding-based topic model for short texts,” IEEE Access,
vol. 6, pp. 43612–43621, 2018.
[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distrib-
uted representations of words and phrases and their compositionality,”
in Proc. Adv. Neural Inf. Process. Syst., vol. 26, 2013, pp. 3111–3119.
[14] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for
word representation,” in Proc. Conf. Empirical Methods Natural Lang.
Process. (EMNLP), Doha, Qatar, Oct. 2014, pp. 1532–1543.
[15] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
vectors with subword information,” 2016, arXiv:1607.04606. [Online].
Fig. 4. Test scores of DEP-CNN on different pre-trained word vectors.
Available: http://arxiv.org/abs/1607.04606
Accuracy measures the multi-class text classification. F1 score measures the
multi-label text classification. (a) Multi-class datasets. (b) Multi-label datasets. [16] A. Adhikari, A. Ram, R. Tang, and J. Lin, “Rethinking complex neural
network architectures for document classification,” in Proc. (NAACL-
HLT), vol. 1, 2019, pp. 4046–4051.
V. C ONCLUSION [17] P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: Sequence
generation model for multi-label classification,” in Proc. 27th Int. Conf.
In this article, we introduce an efficient neural network Computat. Linguistics, 2018, pp. 3915–3926.
model with gating units and shortcut connections, named [18] P. Zhou et al., “Attention-based bidirectional long short-term memory
DEP-CNN, for multi-class and multi-label text classification. networks for relation classification,” in Proc. 54th Annu. Meeting Assoc.
Computat. Linguistics, vol. 2, 2016, pp. 207–212.
Without dependence on lexical resources or NLP toolkits, its [19] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical
DEPG can selectively control how much context information is attention networks for document classification,” in Proc. Conf. North
incorporated into each specific position by converting a word- Amer. Chapter Assoc. Computat. Linguistics: Hum. Lang. Technol.,
2016, pp. 1480–1489.
embedding matrix and delivering it. We prove the effectiveness [20] Y. Wang, M. Huang, X. Zhu, and L. Zhao, “Attention-based LSTM for
of DEP-CNN by comparing it with other up-to-date methods aspect-level sentiment classification,” in Proc. Conf. Empirical Methods
via extensive experiments on four known benchmark datasets. Natural Lang. Process. (EMNLP), 2016, pp. 606–615.
In addition, a visualization experiment shows that DEPG can [21] Y. Xia, H. Yu, and F.-Y. Wang, “Accurate and robust eye center
localization via fully convolutional networks,” IEEE/CAA J. Automat.
indeed learn to acquire more meaningful features for a given Sinica, vol. 6, no. 5, pp. 1127–1138, Sep. 2019.
text. According to the test scores of DEP-CNN on different [22] Y. Kim, “Convolutional neural networks for sentence classification,” in
pre-trained word vectors, we also observe that our method Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014,
pp. 1746–1751.
exhibits high robustness. Our future work includes how to [23] J. Liu, W. C. Chang, Y. Wu, and Y. Yang, “Deep learning for extreme
employ a DEPG layer to other neural networks [48]–[52] and multi-label text classification,” in Proc. 40th Int. ACM SIGIR Conf. Res.
add additional information derived from internal or external Develop. Inf. Retr., Aug. 2017, pp. 115–124.
sources to improve the performance of the model. [24] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional net-
works for text classification,” in Proc. Adv. Neural Inf. Process. Syst.,
pp. 649–657, 2015.
[25] P. M. Sosa, “Twitter sentiment analysis using combined LSTM-CNN
R EFERENCES models,” Zugriff Am, vol. 10, p. 2019, Jun. 2017.
[1] K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. E. Barnes, [26] D. Zhang and Z. Yang, “Word embedding perturbation for sen-
and D. E. Brown, “Text classification algorithms: A survey,” 2019, tence classification,” 2018, arXiv:1804.08166. [Online]. Available:
arXiv:1904.08067. [Online]. Available: http://arxiv.org/abs/1904.08067 http://arxiv.org/abs/1804.08166
[2] S. I. Wang and D. C. Manning, “Baselines and Bigrams: Simple, Good [27] W. Zhao, J. Ye, M. Yang, Z. Lei, S. Zhang, and Z. Zhao, “Investigating
Sentiment and Topic Classification,” in Proc. 50th Annu. Meeting Assoc. capsule networks with dynamic routing for text classification,” 2018,
Computat. Linguistics (ACL), 2012, pp. 90–94. arXiv:1804.00538. [Online]. Available: http://arxiv.org/abs/1804.00538
[3] A. Nigam, P. Sahare, and K. Pandya, “Intent detection and slots prompt [28] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,”
in a closed-domain chatbot,” in Proc. IEEE 13th Int. Conf. Semantic 2015, arXiv:1505.00387. [Online]. Available: http://arxiv.org/abs/1505.
Computat. (ICSC), Newport Beach, CA, USA, Jan. 2019, pp. 340–343. 00387
[4] X. Kang, F. Ren, and Y. Wu, “Exploring latent semantic information [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
for textual emotion recognition in blog articles,” IEEE/CAA J. Automat. Computat., vol. 9, no. 8, pp. 1735–1780, 1997.
Sinica, vol. 5, no. 1, pp. 204–216, Jan. 2018. [30] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,
[5] X. Wang, Q. Kang, J. An, and M. Zhou, “Drifted Twitter spam A. Graves, and K. Kavukcuoglu, “Conditional image generation with
classification using multiscale detection test on K-L divergence,” IEEE PixelCNN decoders,” 2016, arXiv:1606.05328. [Online]. Available:
Access, vol. 7, pp. 108384–108394, 2019. http://arxiv.org/abs/1606.05328

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.
982 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 3, MARCH 2022

[31] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling Jing Chen received the B.S. degree in measurement
with gated convolutional networks,” 2016, arXiv:1612.08083. [Online]. and control technology and instrument from China
Available: http://arxiv.org/abs/1612.08083 Jiliang University, Hangzhou, China, in 2018. He is
[32] K.-I. Funahashi and Y. Nakamura, “Approximation of dynamical systems currently pursuing the M.S. degree in control engi-
by continuous time recurrent neural networks,” Neural Netw., vol. 6, neering with Tongji University, Shanghai, China.
no. 6, pp. 801–806, Jan. 1993. His research focuses on the use of deep learning
[33] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, technology for natural language processing tasks.
“Exploring the limits of language modeling,” 2016, arXiv:1602.02410.
[Online]. Available: http://arxiv.org/abs/1602.02410
[34] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
of gated recurrent neural networks on sequence modeling,” 2014,
arXiv:1412.3555. [Online]. Available: http://arxiv.org/abs/1412.3555
[35] L. Xu, “An overview and perspectives on bidirectional intelligence: Qi Kang (Senior Member, IEEE) received the B.S.
Lmser duality, double IA harmony, and causal computation,” IEEE/CAA degree in automatic control, the M.S. degree in
J. Automat. Sinica, vol. 6, no. 4, pp. 865–893, Jul. 2019. control theory and control engineering, and the Ph.D.
[36] C. M. Bishop, Neural Networks for Pattern Recognition. London, U.K.: degree in control theory and control engineering
Oxford Univ. Press, 1995. from Tongji University, Shanghai, China, in 2002,
[37] N. Schraudolph, “Accelerated gradient descent by factor-centering
2005, and 2009, respectively.
decomposition,” IDSIA, Lugano, Switzerland, Tech. Rep. IDSIA 98,
From 2007 to 2008, he was a Research Associate
1998.
with the University of Illinois, Chicago, IL, USA.
[38] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very
From 2014 to 2015, he was a Visiting Scholar with
deep networks,” 2015, arXiv:1507.06228. [Online]. Available:
the New Jersey Institute of Technology, Newark,
http://arxiv.org/abs/1507.06228
[39] Y. Kim, Y. Jernite, D. Sontag, and M. A. Rush, “Character-Aware NJ, USA. He is currently a Professor with the
Neural Language Models,” in Proc. 13th AAAI Conf. Artif. Intell., 2016, Department of Control Science and Engineering and the Shanghai Institute
pp. 2741–2749. of Intelligent Science and Technology, Tongji University, Shanghai, China.
[40] N. Vinod and G. E. Hinton, “Rectified linear units improve restricted His interests are in swarm intelligence, evolutionary computation, machine
Boltzmann machines,” in Proc. Int. Conf. Mach. Learn., 2010, learning, and intelligent control and optimization in transportation, energy,
pp. 807–814. and water systems.
[41] S. Grossberg, Studies of Mind and Brain. Springer, 1982.
[42] C. Apté, F. Damerau, and S. M. Weiss, “Automated learning of decision
rules for text categorization,” ACM Trans. Inf. Syst., vol. 12, no. 3, MengChu Zhou (Fellow, IEEE) joined the New
pp. 233–251, Jul. 1994. Jersey Institute of Technology (NJIT), Newark, NJ,
[43] P. D. Kingma and J. Ba, “Adam: A method for stochastic optimiza- USA, in 1990, where he is currently a Distinguished
tion,” in Proc. Int. Conf. Learn. Represent., 2014. [Online]. Available: Professor. He has over 900 publications, including
https://arxiv.org/abs/1412.6980 12 books, over 600 journal articles (over 500 in IEEE
[44] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and T RANSACTIONS ), 28 patents, and 29 book chapters.
R. R. Salakhutdinov, “Improving neural networks by preventing co- His interests are in Petri nets, intelligent automation,
adaptation of feature detectors,” 2012, arXiv:1207.0580. [Online]. the Internet of Things, and big data.
Available: http://arxiv.org/abs/1207.0580 Prof. Zhou is fellow of International Federation
[45] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Automatic Control, American Association for
of deep bidirectional transformers for language understanding,” 2018, the Advancement of Science, Chinese Association
arXiv:1810.04805. [Online]. Available: http://arxiv.org/abs/1810.04805 of Automation, and the National Academy of Inventors (NAI). He was a
[46] M. E. Peters et al., “Deep contextualized word representations,” 2018, recipient of the Humboldt Research Award for U.S. Senior Scientists from
arXiv:1802.05365. [Online]. Available: http://arxiv.org/abs/1802.05365 the Alexander von Humboldt Foundation, the Franklin V. Taylor Memorial
[47] Y. Liu, L. Ji, R. Huang, T. Ming, C. Gao, and J. Zhang, “An attention- Award and the Norbert Wiener Award from the IEEE Systems, Man and
gated convolutional neural network for sentence classification,” Intell. Cybernetics Society, the Excellence in Research Prize and Medal from NJIT,
Data Anal., vol. 23, no. 5, pp. 1091–1107, Oct. 2019. and the Edison Patent Award from the Research and Development Council of
[48] S. Gao, M. Zhou, Y. Wang, J. Cheng, H. Yachi, and J. Wang, “Den- New Jersey.
dritic neuron model with effective learning algorithms for classification,
approximation, and prediction,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 30, no. 2, pp. 601–614, Feb. 2019. Abdullah Abusorrah (Senior Member, IEEE)
[49] Z. Huang, X. Xu, H. Zhu, and M. Zhou, “An Efficient group recommen- received the Ph.D. degree in electrical engineering
dation model with multiattention-based neural networks,” IEEE Trans. from the University of Nottingham, Nottingham,
Neural Netw. Learn. Syst., vol. 31, no. 11, pp. 4461–4474, Nov. 2020. U.K., in 2007.
[50] G. Cai, Y. Wang, L. He, and M. Zhou, “Unsupervised domain adaptation He is currently a Professor with the Department
with adversarial residual transform networks,” IEEE Trans. Neural Netw. of Electrical and Computer Engineering, King Abdu-
Learn. Syst., vol. 31, no. 8, pp. 3073–3086, Aug. 2020. laziz University, Jeddah, Saudi Arabia. He is also the
[51] M. Ghahramani, Y. Qiao, M. Zhou, A. O Hagan, and J. Sweeney, Head of the Center for Renewable Energy and Power
“AI-based modeling and data-driven evaluation for smart manufacturing Systems, King Abdulaziz University. His interests
processes,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 4, pp. pp. 1026–1037, are energy systems, smart grid, and system analyses.
Jul. 2020.
[52] Q. Kang, S. Yao, M. Zhou, K. Zhang, and A. Abusorrah, “Enhanced
subspace distribution matching for fast visual domain adaptation,” IEEE
Trans. Computat. Social Syst., vol. 7, no. 4, pp. 1047–1057, Aug. 2020. Khaled Sedraoui received the B.S. degree from
the Institute of Technology and Sciences Tunis
(ESSTT), Tunis, Tunisia, in 1989, the M.Sc. degree
Zhipeng Tan received the B.S. degree in electrical from the École de Technologie Supérieure, Quebec
engineering and automation from Shanghai Univer- University, Montreal, QC, Canada, in 1994, and the
sity, Shanghai, China, in 2018. She is currently Ph.D. degree from ESSTT in 2010, all in electrical
pursuing the M.S. degree in control science and engineering.
engineering with Tongji University, Shanghai. He is currently an Associate Professor of elec-
Her research focuses on the use of deep learning trical and computer engineering with the College
technology for natural language processing tasks. of Engineering, King Abdulaziz University, Jeddah.
His interests include renewable energies integration
and power quality improvement, design and control strategy for flexible ac
transmission system, electric energy storage for renewable energy systems and
electric vehicles, and optimization for smart grid.

Authorized licensed use limited to: Universita degli Studi di Roma Tor Vergata. Downloaded on January 01,2024 at 11:21:43 UTC from IEEE Xplore. Restrictions apply.

You might also like