Zhou 2020
Zhou 2020
Yifan Zhou
Wuhan University, China
[email protected]
132
used at the beginning, and then these word vectors are learned.
The learning method has the same weight as the learning neural
network. Another way is the fine-tunning method in the non-static
mode, which initializes the word vector with a pre-trained
word2vec vector, and adjusts the word vector during training to
accelerate convergence. Directly randomizing the word vector
works well if there is sufficient training data and resources.
The Word2Vec model infers the word vector of each word based
on the context of the context. Using the maximum likelihood
method, the probability of the target vocabulary w is maximized
One-hot word vector Word embedding dense low- given the previous statement h. If two words can be replaced with
Sparse high-dimensional dimensional each other in the context of context, then the distance between the
two words is very close. The Word2Vec model trains on each
Figure 1. Word Vector sentence in the dataset and slides over the sentence in a fixed
Through research, there are two main methods for obtaining word window, predicting the word vector of the word in the middle of
embedding, one is static and the other is non-static: the fixed window based on the context of the sentence. The output
of the model is called an embedded matrix.
(1) Static mode. First, another machine learning model is used to
pre-train the word embedding, and then the trained words are Word2Vec is divided into two modes: CBOW (Continuous Bags
embedded into the sentiment analysis model. During the training of Words) and Skip-Gram. CBOW is to infer the target word from
process, model does not update the word vector, which is a kind the original sentence, and Skip-Gram, on the other hand, guesses
of migration learning. Static mode is suitable for situations where the original sentence from the target word. When CBOW is used
the amount of data is small. and the amount of data is small, Skip-Gram works well in large
corpora.
(2) Non-static mode. Learn word embedding while completing the
text classification task. For example, a random word vector is
133
Figure 3. TextCNN Model
In CNN model, the feature is the word vector, including static and directional RNN can be understood in a sense to capture variable-
non-static word vector. The static method uses word vectors such length and bidirectional n-gram information. In [2], the design of
as word2vec pre-training. The training process does not update the RCNN for classification problem is introduced. The following
word vector. We can regard it as a kind of migration learning. figure shows the schematic diagram of the network structure
Especially when the amount of data is not relatively large, the principle. The example uses the result of the last word to directly
static word vector tends to work well. Non-static updates the word connect the full connection layer softmax output.
vector during training. The recommended method is the fine-
tunning method in non-static. It initializes the word vector with a In [2], a method for capturing the semantics of text using RNN.
pre-trained word2vec vector. Adjusting the word vector during The following figure shows the network structure of the model
training can accelerate convergence. Of course, if there is which is a bidirectional recurrent neutral network. The input to the
sufficient training data and Resources, direct random initialization model is an article D (consisting of the word sequence
of the word vector effect is also possible. w1,w2...).The output of the model is class elements. We use
𝑝(𝑘|𝐷, 𝜃) to denote the probability of the document being class
The above process uses a filter to extract a kind of feature, and the k,where theta is the parameters in the network.The word is
next process reuses multiple different filters to extract different presented by itself and its context. With the help of context, we
kinds of feature. These features form a third layer and then output can obtain a more precise word information. In this model,
a probability distribution of the category labels through a fully 𝑐𝑐(𝑤𝑖 ) as the left context of word 𝑤𝑖 and 𝑐𝑐(𝑤𝑖 ) as the right
connected softmax layer. context of word 𝑤𝑖 .And 𝑐𝑐(𝑤𝑖 ) = 𝑓�𝑊𝑊𝑊(𝑤𝑖 − 1) +
𝑊𝑊(𝑤𝑖 − 1)�. W(𝑙) is a matrix that transforms the hidden layer
3.2 RCNN context into the next hidden layer. W(𝑠𝑠) is a matrix that is used
One of the biggest problems with CNN is the fixed view of
to combine the semantic of the current word with the next word’s
filter_size. On the one hand, it is impossible to model longer
left context. 𝑓 is a non-linear activation function.In Equation (3),
sequence information. On the other hand, the super-parameter
we define the representation of word 𝑤𝑖 , which is the
adjustment of filter_size is cumbersome. The essence of CNN is
concatenation of the left-side context vector 𝑐𝑐(𝑤𝑖 ) ,the word
to do the feature expression work of text, and the more commonly
used in natural language processing is Recurrent Convolution embedding 𝑒(𝑤𝑖 ) and the right-side context vector 𝑐𝑐(𝑤𝑖 ) .
Neural Network (RCNN), which can better express context 𝑥𝑖 = [𝑐𝑐(𝑤𝑖 ); 𝑒(𝑤𝑖 ); 𝑐𝑐(𝑤𝑖 )] (3).
information. Specifically in the text classification task, the Bi-
Figure4.RCNN Model
134
After that, we obtain the representation 𝑥𝑖 of the word 𝑤𝑖 . Firstly, belongs. This kind of consideration is obviously more reasonable.
we apply a linear transformation. Secondly, use tanh activation Use the attention mechanism to reconsider the importance of each
function to and send the result to the next layer. When all of the sentence, improving the model in which RNN and LSTM only use
representations of words are calculated, we use a max-pooling hidden variables. CNN and RNN are used in text categorization
layer. The max function is an element-wise function. The k-th tasks, although the effect is significant, there is a deficiency that
element of 𝑦𝑖 (2) is the maximum in the k-th elements of 𝑦 (3) . The the model can not be well explained. The attention mechanism is a
pooling layer converts text with various lengths into a fixed- commonly used modeling long-term memory mechanism in the
length vector. With the pooling layer, we can capture the field of natural language processing. It can intuitively give the
information throughout the entire text. The last part of our model contribution of each word to the result, which is basically the
is an output layer. Similar to traditional neural networks, it is standard of the Seq2Seq model. In fact, text categorization can
defined as 𝑦 (4) = 𝑤 (4) 𝑦 (3) + 𝑏 (4) .Finally, the softmax function is also be understood as a special Seq2Seq in a certain sense, so
applied to 𝑦 (4) . consider introducing attention mechanism to the near.
135
Compared with neural network-based classification algorithms, [2] Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao.Recurrent
fastText has three major advantages: 1) fastText training speed Convolutional Neural Networks for Text
and Standard multicore CPU is used to complete more than 1 Classification[J].AAAI Conference on Articifiacl
billion words of fastText in less than 10 minutes, and 500,000 Intelligence,2013.
sentences in the 312k class are sorted in less than a minute. 2) [3] Zhou C, Sun C, Liu Z, et al. A C-LSTM Neural Network for
fastText does not require pre-trained word vector, it can self- Text Classification[J]. Computer Science, 2015, 1(4):39-44.
training.3) fastText has two important optimizations: Hierarchical
Softmax, N-gram. Use Hierarchical softmax instead of softmax, [4] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for
combined with Huffman coding, to reduce complexity to log level. Efficient Text Classification[J]. 2016.
[5] Aggarwal,C.C.,and Zhai,C. A Survey of Text Classification
4. CONCLUSION AND FUTURE WORK Algorithms[J]. 2012.
The deep neural network model like CNN was initially successful
in the field of imaging, and its power point is to capture local [6] Duchi, E. Hazan, Y. Singer. 2011 Adaptive subgradient
correlation. In text classification tasks, CNN can be used to methods for online learning and stochastic
extract key information automatically which is similar to n-grams optimization[J].Journal of Machine Learning
in sentences. RNN can capture contextual information as far as Research,12:2121–2159.
possible when learning word representations with recurrent [7] Dong, F. Wei, S. Liu, M. Zhou, K. Xu. 2014. A Statistical
structure and can constructs the representations of text using a Parsing Framework for Sentiment Classi-fication[J]. CoRR,
convolutional neutral network. LSTM is able to learn phrase-level abs/1401.6330.
features through a convolutional layer and to learn long-term [8] Graves, A. Mohamed, G. Hinton. 2013. Speech recognition
dependencies by sequences of higher-level representations. with deep recurrent neural networks[J]. In Proceedings of
Besides, with the improvement of language model like BERT and ICASSP 2013.
ERNIE, they have achieved more better results than many deep
neural network models. So later, we will pay more attention to [9] Bengio, Y., Courville, A.and Vincent, P. Representation
solve text classification with language model like BERT and learning: A review and new perspectives[J]. IEEE
ERNIE. TPAMI,2013.
[10] Zhang Y, Wallace B. A Sensitivity Analysis of
5. REFERENCES Convolutional Neural Networks for Sentence
[1] Kim Y. Convolutional Neural Networks for Sentence Classification[J]. Computer Science, 2015.
Classification[J]. Eprint Arxiv, 2014.
136