Review of Text Classification Methods On Deep Learning
Review of Text Classification Methods On Deep Learning
1309-1321, 2020
Abstract: Text classification has always been an increasingly crucial topic in natural
language processing. Traditional text classification methods based on machine learning
have many disadvantages such as dimension explosion, data sparsity, limited generalization
ability and so on. Based on deep learning text classification, this paper presents an
extensive study on the text classification models including Convolutional Neural
Network-Based (CNN-Based), Recurrent Neural Network-Based (RNN-based), Attention
Mechanisms-Based and so on. Many studies have proved that text classification methods
based on deep learning outperform the traditional methods when processing large-scale and
complex datasets. The main reasons are text classification methods based on deep learning
can avoid cumbersome feature extraction process and have higher prediction accuracy for a
large set of unstructured data. In this paper, we also summarize the shortcomings of
traditional text classification methods and introduce the text classification process based on
deep learning including text preprocessing, distributed representation of text, text
classification model construction based on deep learning and performance evaluation.
1 Introduction
With the rapid development of the internet and big data, the internet information
increases explosively. As an efficient information retrieval and mining technology, text
classification has achieved extensive attention by natural language processing
researchers, which also plays an important role in the management of internet
information. Text classification can effectively extract valuable information and
automatically classify them from massive information data by using natural language
processing, text mining, machine learning techniques and deep learning. At present,
common text classification applications include sentiment analysis [Tang, Qin, Wei et al.
(2015)], news text classification [Li, Shang and Yan (2016)], and topic classification.
Text classification mainly includes binary classification, multi-classification and
multi-label classification which indicates that a text may belong to one or more categories
simultaneously. The main task of text classification is to divide a given dataset of text
into one or more categories according to their content. If we want to build a text
1 College of
Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.
2 Department of Computer Science, Elizabethtown College, PA, 17022, USA.
* Corresponding Author: Yuling Liu. Email: [email protected].
CMC.doi:10.32604/cmc.2020.010172 www.techscience.com/journal/cmc
1310 CMC, vol.63, no.3, pp.1309-1321, 2020
classification system, we first need to divide the datasets into training set and test set, then
we use the labeled training set to train the classification model, and finally we use this
model to predict the category results of the test set.
The major challenges of building a well-performed text classification system are text
representation and classification model. Existing machine learning methods can rarely
process text data directly, while text representation can make text data mathematically
computable by converting the text data into the numerical data. Text presentation is a
mapping process, which maps a word in a text space to numerical vector space by a
certain method. Traditional text presentations are discrete representations, such as
one-hot encoding, Bag-of-Words [Kesorn and Poslad (2012)], TF-IDF [Soucy and
Mineau (2005); Wu, Luk, Wong et al. (2008)], N-Gram and so on. General natural
language processing problems can be solved by discrete representation, but there are
many problems for scenes with high accuracy, such as high dimension, sparse matrix, no
semantic retention and so on. In order to improve the accuracy of the model, the
researchers proposed the distributed representation of text, in which the main idea is to
map each word to a shorter word vector through training. Common distributed
representations include collinear matrices and word embedding. In the neural network
language model, word embedding can illustrate the semantic relationship between words.
Based on the assumption that classification tasks obey a probability distribution, the
traditional text classification models use Bayes theory to obtain the classifier, but if the
hypothesis is not true, it will affect the classification accuracy. There are several
traditional text classification models [Miao, Zhang, Jin et al. (2018)], such as K-Nearest
Neighbor classification algorithm [Xu and Liu (2008)], Support Vector Machine
algorithm [Wei, Guo, Yu et al. (2013)], Bayesian classification algorithm [Gong and Yu
(2010)], and Decision Tree [Johnson, Oles, Zhang et al. (2002)]. But traditional text
classification methods based on machine learning have many defects such as dimension
explosion, data sparsity and limited generalization ability.
The concept of deep learning has been put forward for the first time since 2006, which
has made great breakthroughs in image recognition, speech recognition and so on. In
recent years, deep learning has also shown the best results in text classification task.
Compared with the traditional machine learning algorithm, deep learning extracts more
abstract features from the input features by deepening the number of levels of the model,
which makes the final classification information of the model more reliable. The deeper
the level of the model, the more reliable the abstract features are extracted. As a result,
the number of parameters of the model increases geometrically as well as the training
time of the model. However, with the advent of the era of big data, enough training
samples can overcome this risk.
The text classification models based on deep learning greatly improve the problems
caused by traditional machine learning methods, which avoid cumbersome feature
extraction process and have higher prediction accuracy. Many text classification models
based on deep learning have been explored. For example, Kim [Kim (2014)] first
proposes the application of the Convolutional Neural Network (CNN) in text
classification task and obtains excellent classification results, which also inspires us to
use deep learning methods in some complex text structures. Later, Recurrent Neural
Review of Text Classification Methods on Deep Learning 1311
Network (RNN) is also becoming more and more popular in text classification. Liu et al.
[Liu, Qiu and Huang (2016)] introduce three text classification methods based on
multi-task learning of RNN. Based on the text RNN model, Attention Mechanism [Yang,
Yang, Dyer et al. (2016)] is also added into the RNN model, which can solve the problem
of long-term dependence of text, directly present the importance of each word. And some
researchers combine the advantages of CNN and RNN by using them to extract global
long-term dependencies and local semantic features.
This paper not only introduces the text classification process based on deep learning, but
also focuses on the classical classification models based on deep learning in recent years.
Section 1 introduces the research background and signification on text classification, and
analyzes the advantages and disadvantages of text representation and classification model
based on traditional methods and deep learning methods. Section 2 presents the text
classification process based on deep learning, which including text preprocessing,
distributed representation of text, text classification models based on deep learning and
performance evaluation. Section 3 analyzes and summarizes the various text
classification models based on deep learning, which including CNN-based, RNN-based,
Attention Mechanism-based. Section 4 draws the conclusion.
Text
preprocessing
Text
classification
models
Text construction Distributed
preprocessing representation
Output
category
Distributed Performance
representation evaluation
Christopher (2017)] of Stanford University and Python Library Genism [Rare (2017)].
precision * recall * 2
F1 = (3)
precision + recall
3.1 CNN-based
Convolutional neural network is a kind of multi-layer complex neural network structure,
which has been widely used in our life and changed our life to some extent. For example,
1314 CMC, vol.63, no.3, pp.1309-1321, 2020
in the field of image recognition, Wang et al. [Wang, Qin, Xiang et al. (2019)] propose
CAPTCHA recognition methods based on deep CNN. Pan et al. [Pan, Qin, Chen et al.
(2019)] propose a food recognition algorithm based on CNN. Moreover, Pan et al. [Pan,
Qin, Xiang et al. (2019)] also combine CNN with agricultural products and propose a
disease monitoring system for agricultural products. In the field of text classification,
Kim proposes an effective text classification method by combining CNN with natural
language. He uses a CNN with a convolutional layer for text classification and compares
different methods such as random initialization, preprocess word embedding, static input
matrix and dynamic input matrix. Finally, he concludes that the static input matrix has the
best classification effect. Kalchbrenner et al. [Kalchbrenner, Grefenstette and Blunsom
(2014)] propose a similar model called the Dynamic Convolutional Neural Network
(DCNN). Unlike the CNN method proposed by Kim, DCNN contains five convolutional
layers and multiple temporary k-max pool layers. The k-max pooling extracts the k top
values from a series of convolutional filters and make sure the output length is fixed.
Moreover, Johnson et al. [Johnson and Zhang (2014)] also propose a similar model. Their
model uses up to six convolutional layers and three fully connected layers. Because the
combination of CNN and RNN in the field of computer vision has achieved good success,
Xiao et al. [Xiao and Cho (2016)] also combine RNN and CNN in sentence classification.
Their model uses a convolutional network with up to five layers to learn high-level
features, and these high-level features are also used as input to the LSTM.
In the previous text classification works, CNNs use a rather superficial architecture,
where their convolutional layers depth is up to six. Since shallow CNN can only extract
local features with limit window size, Conneau et al. [Conneau, Schwenk, Barrault et al.
(2016)] propose a very deep CNN to extract hierarchical local features in text
classification. Their convolutional layer depth is up to 29. And the architecture achieves
stable performance on eight freely available large-scale datasets. This is the first evidence
that depth is good for convolutional neural network. Similarly, Johnson et al. [Johnson
and Zhang (2017)] propose a Deep Pyramid Convolutional Neural Network (DPCNN),
which carefully study the depth of word-level CNN. This novel DPCNN structure can
effectively extract the features of long-range associations and obtain more global
information. Fig. 2 shows the structure of this DPCNN model. Firstly, this model inputs a
sentence “A good buy!” into the text region embedding layer, which uses word
embedding to generate vector representations for each word in the sentence. It is followed
by stacking of two convolution blocks and a shortcut. They fix the number of feature
maps to 250 and the kernel size to 3. And the shortcut connections with pre-activation
Wσ(x)+b and identity mapping for enable training of deep networks. Downsampling can
effectively represent more global information in the text. In this model, the step size of
the downsampling is 2. This method uses the unsupervised embedding to train the text
region embedding for improving the accuracy and reducing training time.
Review of Text Classification Methods on Deep Learning 1315
3 Conv, 250
Repeat
3 Conv, 250
Pooling,/2 Downsampling
3 Conv, 250
Conv: W σ(X)+b
Pre-activation
3 Region
Conv, 250
embedding
Region
Regionembedding
embedding
Unsupervised
Unsupervised Optional
embedding
embedding
“A good buy!”
After building the text graph, the authors input the graph into a simple two layers GCN for
feature extraction. The first layer activation function uses ReLu and the second layer uses
Softmax Function. The propagation mode between layers is calculated by the Eq. (5).
~ ~
Z = soft max( A Re LU ( A XW 0)W 1) (5)
1 1
~ − −
where A is an adjacency matrix of text graph, and the A = D 2 AD 2 is the normalized
symmetric adjacency matrix. The W0 and W1 are weight matrix. Compared with the
previous CNN-based models, the text GCN model achieves the better sort effect on
multiple text classification benchmark data sets, which shows better robustness. Although
text GCN can produce better text classification results, it cannot quickly generate
embedding. In the future, we can use attention mechanism in GCN to improve
classification performance and develop unsupervised text GCN for large-scale unlabeled
text data representation learning.
3.2 RNN-based
At present, the Recurrent Neural Networks has been widely used in machine translation,
speech recognition, image description generation and other sequence data processing
tasks. The bidirectional recurrent structure is introduced into RNN, which solves the
problem of interrelation between input information. RNN has great advantages when
modeling sequentially in text sequences and long-term dependencies [Chung, Gulcehre,
Cho et al. (2014)]. The main application model of text classification is bidirectional
recursive neural network (BRNN), which is proposed by Socher et al. [Socher,
Pennington, Huang et al. (2011)]. The bidirectional recursive structure assumes that the
current output is related to the previous information and the following information, which
can capture global long-term dependencies. Therefore, RNN has multiple variant models
on text classification. The Long Short-Term Memory Network (LSTM) is a variant of
RNN that can solve long-term dependencies problems. LSTM updates the hidden state of
each layer by removing or adding information to the cell state through the “gate”
structure. Tang et al. [Tang, Qin and Liu (2015)] propose the gated recurrent network
models to learn the semantics of sentences and their context relations. Firstly, the model
learns text representation by CNN or LSTM. Then by using a gated recurrent neural
network structure, the semantics of sentences and their relations are encoded into a
document representation.
Lai et al. [Lai, Xu, Liu et al. (2015)] design a more complex network structure. They
propose a Recurrent Convolution Neural Network (RCNN), which combine RNN with
CNN and use bidirectional LSTM to obtain the context representation of each word. Firstly,
all left-side and right-side contexts semantics are captured. The left-side context cl(wi) is the
Review of Text Classification Methods on Deep Learning 1317
left-side context of word wi , which calculated by Eq. (6). The W(sl) is a matrix that
combined the semantic of the current word with the next word’s left word, and the f
function is a non-liner activation function. In the same way, cr(wi) is the right-side context
of word wi , which is calculated by Eq. (7). Then, this model used Eq. (8) to define wi,
which is the representation of word and it’s connected by the cl(wi), e(wi) and cr(wi).
( 6)
(7)
(8)
Finally, this model uses max-pooling to extract the maximum features of the vector in
order to obtain the information of the whole text. In this method, a new model is
constructed flexibly by combining RNN and CNN, and the advantages of the two models
are combined to improve the final text classification performance.
alleviate the gradient disappearance problem when RNN captures the sequence
information of the document. However, HANs are much slower to train because they
utilize RNN. Gao et al. [Gao, Ramanathan and Tourassi (2018)] propose a Hierarchical
Convolutional Attention Network (HCAN), an architecture based self-attention that can
capture semantic relationships over long sequences like RNN, which also can achieve
both fast like CNN and accurate performance in text classification task. Their experiment
also shows that self-attention-based models may replace RNN-based models to reduce
training time without sacrificing accuracy.
Taking the classification of news texts as an example, Cai et al. [Cai, Li, Wang et al.
(2018)] use Sohu news data as the datasets, which are from 18 channels including
domestic, international, sports, social and entertainment during June 2012 and July 2012.
Firstly, the datasets are preprocessed, then build multiple models according to the training
set. Finally predict the test set’s output category and carry out an intensive calculation.
The test results are shown in Tab. 1 [Cai, Li, Wang et al. (2018)].
Table 1: The classification results of the datasets
The text classification models based on deep learning in the above table are common
methods. Compared with the previous research results, their performance and efficiency are
improved. Although there are many model variants, a good and suitable model depends not
only on the type of project task but also on the type and the size of the datasets.
4 Conclusion
This paper mainly presents the text classification methods based on deep learning and several
classic text classification network models. The analysis results show that training a good text
classification model depends on not only the deep learning network models, but also the
training data. Moreover, the network structure based on the attention mechanism can
intuitively explain what is valuable information in the process of text classification, which is
conducive to the improvement of system performance. With the rapid development of deep
learning, the text classification methods based on deep learning will face more severe
challenges. In future work, it is necessary to pay attention to universality, accuracy, training
speed, prediction speed, interpretability and the difficulties of adjusting parameters.
Funding Statement: This work supported in part by the National Natural Science
Foundation of China under Grant 61872134, in part by the Natural Science Foundation of
Hunan Province under Grant 2018JJ2062, in part by Science and Technology
Development Center of the Ministry of Education under Grant 2019J01020, and in part
Review of Text Classification Methods on Deep Learning 1319
by the 2011 Collaborative Innovative Center for Development and Utilization of Finance
and Economics Big Data Property, Universities of Hunan Province.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report
regarding the present study.
References
Bian, J.; Gao, B.; Liu, T. Y. (2014): Knowledge-powered deep learning for word
embedding. Joint European Conference on Machine Learning and Knowledge Discovery
in Databases, pp. 132-148.
Cai, J.; Li, J.; Li, W.; Wang, J. (2018): Deep learning model used in text classification.
International Computer Conference on Wavelet Active Media Technology and
Information Processing, pp. 123-126.
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. (2014): Empirical evaluation of gated
recurrent neural networks on sequence modeling. arXiv preprint arXiv, pp. 1412-3555.
Conneau, A.; Schwenk, H.; Barrault, L.; Lecun, Y. (2016): Very deep convolutional
networks for text classification. arXiv preprint arXiv, pp. 1606-01781.
Davis, C. A.; Fonseca, F. T. (2007): Assessing the certainty of locations produced by an
address geocoding system. Geoinformatica, vol. 11, no. 1, pp. 103-129.
Du, J.; Gui, L.; Xu, R.; He, Y. (2017): A convolutional attention model for text
classification. National CCF Conference on Natural Language Processing and Chinese
Computing, pp. 183-195.
Gai, R. L.; Gao, F.; Duan, L. M.; Sun, X. H.; Li, H. Z. (2014): Bidirectional maximal
matching word segmentation algorithm with rules. Advanced Materials Research, vol.
926, pp. 3368-3372.
Gao, S.; Ramanathan, A.; Tourassi, G. (2018): Hierarchical convolutional attention
networks for text classification. Oak Ridge National Lab, Oak Ridge & TN.
Goldberg, Y.; Levy, O. (2014): Word2vec explained: deriving Mikolov et al.’s
negative-sampling word-embedding method. arXiv preprint arXiv, pp. 1402-3722.
Gong, Z.; Yu, T. (2010): Chinese web text classification system model based on naive
Bayes. International Conference on E-Product E-Service & E-Entertainment, pp. 1-4.
Hinton, G. E. (1986): Learning distributed representations of concepts. Proceedings of
the Eighth Annual Conference of the Cognitive Science Society, vol. 1, pp. 1-12.
Hu, B.; Tang, B.; Chen, Q.; Kang, L. (2016): A novel word embedding learning model
using the dissociation between nouns and verbs. Neurocomputing, vol. 171, pp. 1108-1117.
Jeffery, P, Richard, S, Christopher, M. (2017): GloVe: global vectors for word
representation. https://nlp.stanford.edu/projects/glove/.
Johnson, R.; Zhang, T. (2014): Effective use of word order for text categorization with
convolutional neural networks. arXiv preprint arXiv, pp. 1412-1058.
Johnson, D. E.; Oles, F. J.; Zhang, T.; Goetz, T. (2002): A decision-tree-based symbolic
rule induction system for text categorization. IBM Systems Journal, vol. 41, no. 3, pp. 428-437.
1320 CMC, vol.63, no.3, pp.1309-1321, 2020
Johnson, R.; Zhang, T. (2017): Deep pyramid convolutional neural networks for text
categorization. Proceedings of the Annual Meeting of the Association for Computational
Linguistics, vol. 1, pp. 562-570.
Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. (2014): A convolutional neural
network for modeling sentences. arXiv preprint arXiv, pp. 1404-2188.
Kesorn, K.; Poslad, S. (2012): An enhanced bag-of-visual word vector space model to
represent visual content in athletics images. IEEE Transactions on Multimedia, vol. 14,
no. 1, pp. 211-222.
Kim, Y. (2014): Convolutional neural networks for sentence classification. arXiv
preprint, pp. 1408-5882.
Krizhevsky, A.; Sutskever, I.; Hinton, G. E. (2012): Imagenet classification with deep
convolutional neural networks. AdvancesiIn Neural Information Processing Systems, pp.
1097-1105.
Lai, S.; Xu, L.; Liu, K.; Zhao, J. (2015): Recurrent convolutional neural networks for
text classification. Twenty-ninth AAAI Conference on Artificial Intelligence.
Li, H.; Chen, P. H. (2014): Improved backtracking-forward algorithm for maximum matching
Chinese word segmentation. Applied Mechanics and Materials, vol. 536, pp. 403-406.
Li, Z.; Shang, W.; Yan, M. (2016): News text classification model based on topic model.
IEEE/ACIS International Conference on Computer & Information Science, pp. 1-5.
Liu, P.; Qiu, X.; Huang, X. (2016): Recurrent neural network for text classification with
multi-task learning. arXiv preprint arXiv, pp. 1605-05101.
Ma, Q.; Yu, L.; Tian, S.; Chen, E.; Ng, W. W. (2019): Global-local mutual attention
model for text classification. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 27, no. 12, pp. 2127-2139.
Miao, F.; Zhang, P.; Jin, L.; Wu, H. (2018): Chinese news text classification based on
machine learning algorithm. Proceedings of International Conference on Intelligent
Human-Machine Systems and Cybernetics, vol. 2, pp. 48-51.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; Dean, J. (2013): Distributed
representations of words and phrases and their compositionality. Advances in Neural
Information Processing Systems, pp. 3111-3119.
Pan, L.; Qin, J.; Chen, H.; Xiang, X.; Li, C. et al. (2019): Image augmentation-based
food recognition with convolutional neural networks. Computers Materials & Continua,
vol. 59, no. 1, pp. 297-313.
Pan, W.; Qin, J.; Xiang, X.; Wu, Y.; Tan, Y. et al. (2019): A smart mobile diagnosis
system for citrus diseases based on densely connected convolutional networks. IEEE
Access, vol. 7, pp. 87534-87542.
Rare, T. (2017): Gensim: topic modeling for humans.
https://radimrehurek.com/gensim/index.html.
Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T. (2012): Deep neural networks for
acoustic modeling in speech recognition. IEEE Signal Processing Magazine.
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D. et al. (2013): Recursive
Review of Text Classification Methods on Deep Learning 1321
deep models for semantic compositionality over a sentiment treebank. Proceedings of the
2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631-1642.
Socher, R.; Pennington, J.; Huang, E. H.; Ng, A. Y.; Manning, C. D. (2011):
Semi-supervised recursive autoencoders for predicting sentiment distributions. Proceedings
of the Conference on Empirical Methods in Natural Language Processing, pp. 151-161.
Soucy, P.; Mineau, G. W. (2005): Beyond TFIDF weighting for text categorization in
the vector space model. IJCAI, vol. 5, pp. 1130-1135.
Tang, D.; Qin, B.; Wei, F.; Dong, L.; Liu, T. et al. (2015): A joint segmentation and
classification framework for sentence level sentiment classification. IEEE/ACM
Transactions on Audio, Speech and Language Processing, vol. 23, no. 11, pp. 1750-1761.
Tang, D.; Qin, B.; Liu, T. (2015): Document modeling with gated recurrent neural
network for sentiment classification. Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pp. 1422-1432.
Tomas, M. (2017): Word2Vec. https://code.google.com/archive/p/word2vec/.
Wang, J.; Qin, J. H.; Xiang, X. Y.; Tan, Y.; Pan, N. (2019): CAPTCHA recognition
based on deep convolutional neural network. Mathematical Biosciences and Engineering
vol. 16, no. 5, pp. 5851-5861.
Wang, S.; Huang, M.; Deng, Z. (2018): Densely connected CNN with multi-scale
feature attention for text classification. IJCAI, pp. 4468-4474.
Wei, S.; Guo, J.; Yu, Z.; Chen, P.; Xian, Y. (2013): The instructional design of Chinese
text classification based on SVM. Chinese Control and Decision Conference, pp. 5114-5117.
Wu, H. C.; Luk, R. W. P.; Wong, K. F.; Kwok, K. L. (2008): Interpreting TF-IDF term
weights as making relevance decisions. ACM Transactions on Information Systems, vol.
26, no. 3, pp. 1-37.
Xiao, Y.; Cho, K. (2016): Efficient character-level document classification by combining
convolution and recurrent layers. arXiv preprint arXiv, pp. 1602-00367.
Xu, Q. N.; Liu, Z. (2008): Automatic Chinese text classification based on
NSVMDT-KNN. IEEE International Conference on Fuzzy Systems & Knowledge
Discovery, vol. 2, pp. 410-414.
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A. et al. (2016): Hierarchical attention
networks for document classification. Proceedings of the Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pp. 1480-1489.
Yao, L.; Mao, C.; Luo, Y. (2019): Graph convolutional networks for text classification.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7370-7377.
Zhang, L.; Li, Y.; Meng, J. (2006): Design of Chinese word segmentation system based
on improved Chinese converse dictionary and reverse maximum matching algorithm.
International Conference on Web Information Systems Engineering, pp. 171-181.
Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B. et al. (2016): Attention-based bidirectional
long short-term memory networks for relation classification. Proceedings of the Annual
Meeting of the Association for Computational Linguistics, vol. 2, pp. 207-212.