0% found this document useful (0 votes)
27 views5 pages

Zhou 2020

Uploaded by

gadisa gemechu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views5 pages

Zhou 2020

Uploaded by

gadisa gemechu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Review of Text Classification Based on Deep Learning

Yifan Zhou
Wuhan University, China
[email protected]

ABSTRACT time, the feature extraction method in the traditional machine


Text classification is the process of discriminating predetermined learning method ignores the word order and context information,
text into a certain class or some certain classes. Text and separates the feature extraction process and the model design
categorization has important applications in redundant filtering, process.
organization management, information retrieval, index building, In contrast to traditional machine learning methods, the end-to-
ambiguity resolution, and text filtering. This paper we will mainly end model will be used in the deep learning machine, and the text
introduce the research background of text classification and tracks features can be extracted automatically through the neural
the research dynamics of text classification at home and abroad. network, that is, learned through the model. Then input into the
Text classification is an essential component in many NLP later level of the model to train the model. The deep learning
problems. Neural network model had achieved extraordinary model expresses the text as a continuous dense vector, solving the
effect in text classification. So we will discuss how the general problem of text representation, and then automatically acquires
methods with deep learning to deal with text classification the feature expression ability by using network structures such as
problems, including Convolution Neural Network(CNN), CNN/RNN. The traditional machine learning text classification
Recurrent convolution neural network(RCNN), Long Short-Term includes text preprocessing, feature extraction and build model.
memory(LSTM), and fastCNN. CNN can construct the However, the deep learning text classification is able to form an
representation of text using a convolutional neural network. RNN end-to-end structure by feature extraction, text representation, and
does well in capturing contextual information. LSTM is explicitly model construction.
designed for time-series data for learning long-term dependencies.
Besides, we will introduce the distributed representation, such as 2. WORD VECTOR REPRESENTATION
Continuous Bags of Words(CBOW) and Skip-Gram. And analyze The deep learning model cannot accept raw text as input and can
the advantages of word2vec model over on-hot encoding. only handle numeric tensors. The process of converting text data
into numerical tensors is called text vectorization. There are two
CCS Concepts main methods for text vectorization: one-hot encoding and word
• Information systems➝Database management system embedding for words.
engines • Computing methodologies
One-hot encoding associates each word with a unique integer
Keywords index, and then converts the integer index i into a binary vector V
Text classification; Word2Vec; CNN; RCNN; LSTM; fastCNN of length N (N is the vocabulary size), and the vector V has only
the i-th element is 1, and the rest The elements are all 0. The
1. INTRODUCTION vector obtained by one-hot coding is high dimensional and sparse.
Text classification technology is a basic task in the field of natural The size of the dimension is equal to the number of words in the
language processing. There are many application scenarios, such vocabulary, and most of the elements are zero. Although the text
as news category classification in news websites, sentiment can be conveniently processed as a vector, the word-relationship
analysis and information retrieval. In fact, there are two between words and words is lost, and the syntactic and semantic
mainstream text classification methods in natural language commonality between words and words cannot be effectively
process. The first method is the traditional machine learning represented. So the second method will be used in this project:
method, and another is the deep learning method. Feature word embedding.
extraction is a key technical point in traditional machine learning
WordW: words → Rn embedding is a parameterized function that
methods. The usual methods are feature extraction based on the
word bag model, pLSA, LDA. However, the text expression maps words to high-dimensional vectors.E.g: W(“cat”) =
formed by these feature extraction methods is high-dimensional (0.2, −0.4,0.7, . . . ), W(“mat”) = (0.0,0.6, −0.1, . .. ).Although the
sparse and the feature expression ability is obvious. At the same word embedding model also maps words to high-dimensional
vectors, for a thousand-level vocabulary, the dimensions are still
Permission to make digital or hard copies of all or part of this work for
much smaller than the one-hot encoded dimensions. At the same
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that time, the vector obtained by the word embedding method is a
copies bear this notice and the full citation on the first page. Copyrights continuous floating-point number vector, and words with similar
for components of this work owned by others than ACM must be meanings will be mapped to positions close to the vector space.
honored. Abstracting with credit is permitted. To copy otherwise, or More information can be incorporated into low-dimensional
republish, to post on servers or to redistribute to lists, requires prior spaces. Moreover, word embedding is different from one-hot
specific permission and/or a fee. Request permissions from coding and is learned from data.
[email protected].
ICGDA 2020, April 15–17, 2020, Marseille, France
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7741-6/20/04…$15.00
DOI: https://doi.org/10.1145/3397056.3397082

132
used at the beginning, and then these word vectors are learned.
The learning method has the same weight as the learning neural
network. Another way is the fine-tunning method in the non-static
mode, which initializes the word vector with a pre-trained
word2vec vector, and adjusts the word vector during training to
accelerate convergence. Directly randomizing the word vector
works well if there is sufficient training data and resources.
The Word2Vec model infers the word vector of each word based
on the context of the context. Using the maximum likelihood
method, the probability of the target vocabulary w is maximized
One-hot word vector Word embedding dense low- given the previous statement h. If two words can be replaced with
Sparse high-dimensional dimensional each other in the context of context, then the distance between the
two words is very close. The Word2Vec model trains on each
Figure 1. Word Vector sentence in the dataset and slides over the sentence in a fixed
Through research, there are two main methods for obtaining word window, predicting the word vector of the word in the middle of
embedding, one is static and the other is non-static: the fixed window based on the context of the sentence. The output
of the model is called an embedded matrix.
(1) Static mode. First, another machine learning model is used to
pre-train the word embedding, and then the trained words are Word2Vec is divided into two modes: CBOW (Continuous Bags
embedded into the sentiment analysis model. During the training of Words) and Skip-Gram. CBOW is to infer the target word from
process, model does not update the word vector, which is a kind the original sentence, and Skip-Gram, on the other hand, guesses
of migration learning. Static mode is suitable for situations where the original sentence from the target word. When CBOW is used
the amount of data is small. and the amount of data is small, Skip-Gram works well in large
corpora.
(2) Non-static mode. Learn word embedding while completing the
text classification task. For example, a random word vector is

Figure 2.Word2Vector Model.


A length n sentence can be expressed as:
3. DEEP LEARNING MODEL FOR TEXT
𝑥1:𝑛 = 𝑥1 ⊕ 𝑥2 ⊕ … ⊕ 𝑥𝑛 ,which ⊕ is the connection operation.
CLASSIFICATION The convolution operation contains a filter w, w ∈ 𝑅 ℎ𝑘 . The filter
In this section we will show you how to use text vectorized data to is used to create new features in windows containing h words. For
build models for text categorization. It mainly introduces the example, the feature 𝑐𝑖 is generated by a filter containing a
application of CNN, RNN, CNN+LSTM and fastText deep
learning model in text classification. window of words 𝑥𝑖:𝑖 + ℎ − 1 . 𝑐𝑖 = 𝑓(𝑤𝑖:𝑖+ℎ−1 + 𝑏) ,where b
represents the offset and is a nonlinear function. A filter can be
3.1 CNN used to create a feature map for all windows of size h. After that, a
The paper [1] proposes that the core idea of CNN model for max-pooling operation is used to extract the maximum value from
sentence-level classification is to use local features. CNN extracts the feature map as a feature, and the most important feature values
different local features through different convolution kernels. As in the feature map can be obtained by max-pooling, thereby
shown in the figure below, it is a structure diagram of CNN for automatically determining which words play a key role in the text
text classification, which is mainly composed of input layer, classification. Besides, max-pooling can reduce parameters and
convolution layer, pooling layer and fully connected layer. Xi calculations, prevent over-fitting.
represents the i-th word in the sentence, and maps xi to a k-
dimensional word vector.

133
Figure 3. TextCNN Model
In CNN model, the feature is the word vector, including static and directional RNN can be understood in a sense to capture variable-
non-static word vector. The static method uses word vectors such length and bidirectional n-gram information. In [2], the design of
as word2vec pre-training. The training process does not update the RCNN for classification problem is introduced. The following
word vector. We can regard it as a kind of migration learning. figure shows the schematic diagram of the network structure
Especially when the amount of data is not relatively large, the principle. The example uses the result of the last word to directly
static word vector tends to work well. Non-static updates the word connect the full connection layer softmax output.
vector during training. The recommended method is the fine-
tunning method in non-static. It initializes the word vector with a In [2], a method for capturing the semantics of text using RNN.
pre-trained word2vec vector. Adjusting the word vector during The following figure shows the network structure of the model
training can accelerate convergence. Of course, if there is which is a bidirectional recurrent neutral network. The input to the
sufficient training data and Resources, direct random initialization model is an article D (consisting of the word sequence
of the word vector effect is also possible. w1,w2...).The output of the model is class elements. We use
𝑝(𝑘|𝐷, 𝜃) to denote the probability of the document being class
The above process uses a filter to extract a kind of feature, and the k,where theta is the parameters in the network.The word is
next process reuses multiple different filters to extract different presented by itself and its context. With the help of context, we
kinds of feature. These features form a third layer and then output can obtain a more precise word information. In this model,
a probability distribution of the category labels through a fully 𝑐𝑐(𝑤𝑖 ) as the left context of word 𝑤𝑖 and 𝑐𝑐(𝑤𝑖 ) as the right
connected softmax layer. context of word 𝑤𝑖 .And 𝑐𝑐(𝑤𝑖 ) = 𝑓�𝑊𝑊𝑊(𝑤𝑖 − 1) +
𝑊𝑊(𝑤𝑖 − 1)�. W(𝑙) is a matrix that transforms the hidden layer
3.2 RCNN context into the next hidden layer. W(𝑠𝑠) is a matrix that is used
One of the biggest problems with CNN is the fixed view of
to combine the semantic of the current word with the next word’s
filter_size. On the one hand, it is impossible to model longer
left context. 𝑓 is a non-linear activation function.In Equation (3),
sequence information. On the other hand, the super-parameter
we define the representation of word 𝑤𝑖 , which is the
adjustment of filter_size is cumbersome. The essence of CNN is
concatenation of the left-side context vector 𝑐𝑐(𝑤𝑖 ) ,the word
to do the feature expression work of text, and the more commonly
used in natural language processing is Recurrent Convolution embedding 𝑒(𝑤𝑖 ) and the right-side context vector 𝑐𝑐(𝑤𝑖 ) .
Neural Network (RCNN), which can better express context 𝑥𝑖 = [𝑐𝑐(𝑤𝑖 ); 𝑒(𝑤𝑖 ); 𝑐𝑐(𝑤𝑖 )] (3).
information. Specifically in the text classification task, the Bi-

Figure4.RCNN Model

134
After that, we obtain the representation 𝑥𝑖 of the word 𝑤𝑖 . Firstly, belongs. This kind of consideration is obviously more reasonable.
we apply a linear transformation. Secondly, use tanh activation Use the attention mechanism to reconsider the importance of each
function to and send the result to the next layer. When all of the sentence, improving the model in which RNN and LSTM only use
representations of words are calculated, we use a max-pooling hidden variables. CNN and RNN are used in text categorization
layer. The max function is an element-wise function. The k-th tasks, although the effect is significant, there is a deficiency that
element of 𝑦𝑖 (2) is the maximum in the k-th elements of 𝑦 (3) . The the model can not be well explained. The attention mechanism is a
pooling layer converts text with various lengths into a fixed- commonly used modeling long-term memory mechanism in the
length vector. With the pooling layer, we can capture the field of natural language processing. It can intuitively give the
information throughout the entire text. The last part of our model contribution of each word to the result, which is basically the
is an output layer. Similar to traditional neural networks, it is standard of the Seq2Seq model. In fact, text categorization can
defined as 𝑦 (4) = 𝑤 (4) 𝑦 (3) + 𝑏 (4) .Finally, the softmax function is also be understood as a special Seq2Seq in a certain sense, so
applied to 𝑦 (4) . consider introducing attention mechanism to the near.

3.3 C-LSTM 3.4 fastText


In the paper [3], the authors propose a CNN+LSTM model (C- fastText is a fast text classification algorithm and often gets
LSTM) for sentence representation and text classification. C- metrics similar to deep networks, but it is faster than deep
LSTM uses CNN to extract a higher level phrase representation of networks in training time with many orders of magnitude. Using a
a sentence, and then input the phrase representation into the linear model with rank constraints, you can train one billion words
LSTM layer to obtain a sentence representation. C-LSTM can in ten minutes while achieving performance comparable to state-
learn both local features of a phrase as well as features of sentence of-the-art technology. The model structure is very similar to
semantics. CBOW in wor2vec, and both belong to the classification model.
But fastText predicts article category tags and CBOW predicts
The structure of the C-LSTM model is shown in the figure. intermediate words. The fastText model only has one hidden layer
Blocks of the same color in the feature map layer and window and one output layer as shown below.
feature sequence layer corresponds to features for the same
window. The dashed lines connect the feature of a window with
the source feature map. The final output of the entire model is the
last hidden unit of LSTM. This model mainly uses CNN and
LSTM. The input text is convoluted with a filter to form a feature
map. In the feature map, squares of the same color are mapped to
the same window feature sequence layer. Use max-pooling to
select the most important features in the window feature sequence
layer. Then enter it into the LSTM unit. The hidden state of the
last time step of the LSTM is represented as text, and a softmax
layer is added at the end. The model is trained by minimizing the
cross entropy loss function. The parameters of the SGD learning Figure6. fastText Model
model and RM-Sprop are used for optimization. The input of fastText is a sentence with N-gram features
x1,x2...,xn, which are embedded and averaged as the feature
representation of the text to form the hidden varibale. The hidden
layer of fastText is summed by the input layer and averaged,
multiplied by the weight matrix A, which is a look-up table over
the words. It is equivalent to weighting and summing each word
vector as the vector of the sentence. The output layer is obtained
by multiplying the hidden layer by another weight matrix B.
When there are large classes, the computing time will be
expensive and the computational complexity is O(kh),where k is
the number of classes and h is the dimension of the text
representation. fastText use a hierarchical softmax based on the
Huffman coding tree to reduce the computational complexity to
O(hlog2(k)).Each node has a probability that is the probability of
the path from the root to that node.
In order to take word order into account, fastext uses the N-gram
Figure5. C-LSTM Model feature. The inputs are N-gram vectors, which are randomly
generated. Since the amount of N-gram is much larger than words,
LSTM is designed for sequence data, and pooling will break such
it cannot be completely saved. fastText uses the hashing track
sequence organization due to the discontinuous selected features.
method to hash all n-grams into buckets, and hashes all the n-
The LSTM architecture has a range of repeated modules for each
grams in the same bucket to share an embedding vector. fastText
time step as in a standard RNN.LSTM is explicitly designed for
is a simple baseline method for text classification. Unlike
time-series data for learning long-term dependencies, and
unsupervised word vectors from word2vec, the word features of
therefore we choose LSTM upon the convolution layer to learn
the fastText model can be averaged together to form a good
such dependencies in the sequence of higher-level features.
sentence representation.
The core idea of adding a trick-attention on the basis of CNN and
RNN is different for predicting the context in which the text

135
Compared with neural network-based classification algorithms, [2] Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao.Recurrent
fastText has three major advantages: 1) fastText training speed Convolutional Neural Networks for Text
and Standard multicore CPU is used to complete more than 1 Classification[J].AAAI Conference on Articifiacl
billion words of fastText in less than 10 minutes, and 500,000 Intelligence,2013.
sentences in the 312k class are sorted in less than a minute. 2) [3] Zhou C, Sun C, Liu Z, et al. A C-LSTM Neural Network for
fastText does not require pre-trained word vector, it can self- Text Classification[J]. Computer Science, 2015, 1(4):39-44.
training.3) fastText has two important optimizations: Hierarchical
Softmax, N-gram. Use Hierarchical softmax instead of softmax, [4] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for
combined with Huffman coding, to reduce complexity to log level. Efficient Text Classification[J]. 2016.
[5] Aggarwal,C.C.,and Zhai,C. A Survey of Text Classification
4. CONCLUSION AND FUTURE WORK Algorithms[J]. 2012.
The deep neural network model like CNN was initially successful
in the field of imaging, and its power point is to capture local [6] Duchi, E. Hazan, Y. Singer. 2011 Adaptive subgradient
correlation. In text classification tasks, CNN can be used to methods for online learning and stochastic
extract key information automatically which is similar to n-grams optimization[J].Journal of Machine Learning
in sentences. RNN can capture contextual information as far as Research,12:2121–2159.
possible when learning word representations with recurrent [7] Dong, F. Wei, S. Liu, M. Zhou, K. Xu. 2014. A Statistical
structure and can constructs the representations of text using a Parsing Framework for Sentiment Classi-fication[J]. CoRR,
convolutional neutral network. LSTM is able to learn phrase-level abs/1401.6330.
features through a convolutional layer and to learn long-term [8] Graves, A. Mohamed, G. Hinton. 2013. Speech recognition
dependencies by sequences of higher-level representations. with deep recurrent neural networks[J]. In Proceedings of
Besides, with the improvement of language model like BERT and ICASSP 2013.
ERNIE, they have achieved more better results than many deep
neural network models. So later, we will pay more attention to [9] Bengio, Y., Courville, A.and Vincent, P. Representation
solve text classification with language model like BERT and learning: A review and new perspectives[J]. IEEE
ERNIE. TPAMI,2013.
[10] Zhang Y, Wallace B. A Sensitivity Analysis of
5. REFERENCES Convolutional Neural Networks for Sentence
[1] Kim Y. Convolutional Neural Networks for Sentence Classification[J]. Computer Science, 2015.
Classification[J]. Eprint Arxiv, 2014.

136

You might also like