Huang 2018
Huang 2018
Zhang
Accepted Manuscript
PII: S0925-2312(18)30479-X
DOI: 10.1016/j.neucom.2018.04.045
Reference: NEUCOM 19512
Please cite this article as: Weihang Huang, Guozheng Rao, Zhiyong Feng, Qiong Cong, LSTM with
sentence representations for Document-level Sentiment Classification, Neurocomputing (2018), doi:
10.1016/j.neucom.2018.04.045
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Weihang Huang
T
School of Computer Science and Technology, Tianjin University, Tianjin, China
IP
a Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China
b School of Computer Software, Tianjin University, Tianjin, China
CR
Abstract
US
Recently, due to their ability to deal with sequences of different lengths, neural
networks have achieved a great success on sentiment classification. It is widely
AN
used on sentiment classification. Especially long short-term memory networks.
However, one of the remaining challenges is to model long texts to exploit the
semantic relations between sentences in document-level sentiment classification.
M
Existing Neural network models are not powerful enough to capture enough
sentiment messages from relatively long time-steps. To address this problem,
we propose a new neural network model(SR-LSTM) with two hidden layers. The
ED
first layer learns sentence vectors to represent semantics of sentences with long
short term memory network, and in the second layer, the relations of sentences
PT
vectors
∗ Correspondingauthor
Email address: [email protected] (Guozheng Rao )
1. Introduction
T
is a fundamental task in sentiment analysis [2]. Recently, neural networks ap-
IP
proaches have gained a big success on sentiment classification. Neural network
is used to deal with sentiment classification early with frees researchers from
CR
handcrafted feature engineering[3]. Among these methods, Recurrent Neural
Networks (RNNs) are one of the most prevalent architectures because of the
ability to handle variable length texts.
US
Paragragh-level or sentence-level sentiment analysis expects the model to
extract features from limited source of information [4], while document-level
AN
sentiment analysis demands more on selecting and storing global sentiment mes-
sage from long texts with noises and redundant local pattern. Simple Recurrent
Neural Networks are not powerful enough to handle the overflow and to capture
M
ent diffusion and explosion. LSTM can capture the long dependencies in a
sequence by introducing a memory unit and a gate mechanism which aims to
decide how to utilize and update the information kept in memory cell. A cached
CE
tions between all words in a document, but they are not capable of modeling
the intrinsic relations between sentences.
Partially inspired by the structure of LSTM and semantic compositionality[8],
we propose a deep recurrent neural network with two hidden layers. It first pro-
2
ACCEPTED MANUSCRIPT
T
resentation, which takes considerations of semantics with different granularities.
IP
These document representations are used as features to classify the sentiment
label of each document. We conduct document level sentiment classification on
CR
three large-scale review datasets which come from IMDB, Yelp 2014 and Yelp
2015. We compare our model to three classes of model which contain machine
learning methods[9], recurrent neural network models and bi-directional neural
network models.
Our main contributions are US
AN
• We introduce a neural network model with two hidden layers to learn
continuous document representation. The document datasets are divided
into a certain number of sentences so that we can handle shorter sequences
M
• The proposed model can encode the relations between sentences in doc-
ED
ment classification on three document level datasets from IMDB and Yelp
Dataset Challenge.
CE
2. Related work
AC
In this section, we introduce our works in two areas: First, we discuss the
meaning of document-level sentiment classification and the existing approaches;
Second, we discuss recurrent neural networks for Document-level sentiment clas-
sification.
3
ACCEPTED MANUSCRIPT
T
of a document[10]. Pang and Lee put forward the concept of document-level sen-
IP
timent for the first time[11]. The biggest challenge of Document-level sentiment
classification is not only to consider the semantics between words and sentences,
CR
but also to consider the overall context of semantic information to represent doc-
ument composition. We can not simply use the sum of all word representations
to represent the whole document. This is obviously not justified. Various meth-
US
ods have been investigated and explored over years. Most of these methods
depend on traditional machine learning algorithms. Pang first use a supervised
machine learning method and build a SVM classifier and represent documents
AN
with bag-of-words features [2]. Turney uses sentiment phrases extracted from
syntactic patterns for document level sentiment classification[12]. Goldberg uses
a graph-based method in a semi-supervised setting to finish this task[13]. Many
M
research results show that SVM and Naive bayesian classifier have better perfor-
mance than other machine learning methods. Afterwards, Various approaches
ED
handcrafted features.
Most models for distributed representations fall into three classes where
realvalued vectors are used to represent meaning. The three classes are bags-
AC
of-words models, sequence models and tree-structured models. Phrase and sen-
tence representations are independent of word order in bags-of-words models.
They can be generated by averaging constituent word representations[17]. How-
ever, sequence models models construct sentence representations as an order-
sensitive function of the sequence of tokens[18]. Recently, tree-structured models
4
ACCEPTED MANUSCRIPT
T
Long Short-term Memory networks have emerged as a popular model due to
IP
their effectiveness at capturing long-term dependencies. LSTM networks have
been successfully applied to a variety of fields, machine translation [20], speech
CR
recognition [21], understanding subtitles[22] and image caption generation [23].
US
Recently, the use of neural network based methods to deal with sentiment
classification gradually becomes popular. They are prevalent due to their abil-
ity of learning discriminative features from data[23], and they can also think
AN
about the overall context information. With the development of distributed
representations, neural networks advance sentiment classification substantially.
It is known that good word embeddings as inputs can improve neural net-
M
vector can grow or vanish exponentially over long sequences [28]. However, long
short-term neural networks, it can solve this biggest problem of RNNsvanish-
AC
ing gradient problem. Variable models based on LSTM have been proposed to
increase the ability of LSTMs, such as [19] put a tree-structured model into
LSTM for better semantic composition, [29] add an extra sparse matrix into
LSTMs to get a better results and so on. Most of these models work well on
sentence-level and paragragh-level sentiment classification. When it comes to
5
ACCEPTED MANUSCRIPT
T
document-level sentiment classification. In order for LSTM to store longer in-
IP
formation, various models about LSTM have proposed to increase the ability
of LSTMs to store long-range information. For example, [30] put an external
CR
memory into LSTM, but they are of poor performance on time because of the
huge external memory matrix. [31] use a bidirectional LSTM with a attention
model to deal with document-level sentiment classification and [6]came up with
US
CLSTM, it defines a concept of forgetting rates and divides memory into several
groups, and different forgetting rates regarded as filters, are assigned to different
groups.
AN
We can conclude that these methods are proposed on a layer of LSTM model
and modify LSTM structure. As a comparison, we propose a LSTM model with
two hidden layers. the first layer is trained to acquire sentence vectors, a simple
M
method to get sentence vectors is ignoring the order of sentences and averaging
word embeddings as sentence vectors, but this fails to capture complex semantic
ED
relations between words, one could use CNN to get sentence vectors [7]. we try
the first LSTM to achieve this, and the second layer uses sentence vectors to
draw the overall sentiment polarity of the document as document composition,
PT
which can overcome standard LSTM model cant store too long information, but
also through the structure of two layers of LSTM, we can take considerations
CE
In this section, we will review LSTM and some of its variants, and then we
will introduce our method and an approach to improve it.
6
ACCEPTED MANUSCRIPT
T
unit and a gate mechanism. This aims to decide how to utilize and update the
IP
information kept in the memory cell. Because of this structure, it alleviates the
problem of gradient diffusion and explosion. Figure 1 shows the structure of a
CR
standard LSTM at time step t. i, o, f represent the input gate, the output gate
and forget gate. c is the memory cell, the inputs of an input gate it , an output
gate ot , an forget gate ft are all vector xt that the network receives at time t
and its previous hidden size ht−1 .
US
AN
i o
M
ED
PT
f
CE
AC
7
ACCEPTED MANUSCRIPT
T
ht = ot ◦ tanh(ct ) (6)
IP
Where σ is the logistic sigmoid function, ft , it , ot , ct are the forget gate,the
input gate, the output gate, and the memory cell activation vector at time-step
CR
t. The entries of the gating vectors it , ft , ot are in [0,1]. ◦ is the multiplica-
tion operation. b represents bias, all of which have the same size as ht ∈ Rh ,
Wf , Wi , Wo , Wc ∈ RH×d , we refer to d as the memory dimension of the LSTM
US
and bf , bi , bo , bc ∈ RH , and Uf , Ui , Uo , Uc ∈ RH×H . The dimensionality of hid-
den layer and input respectively are H and h.
AN
3.2. Some variable species of LSTM
Since LSTM has been put forward, it has been widely concerned . In recent
years, LSTM in the field of natural language processing is also more and more
M
• PC-LSTM
ED
gate can also accept the input information of the cell state. Compared
with standard LSTM, In the PC-LSTM, the forget gate, the input gate
and the output gate can also accept the whole information in the memory
CE
• CIFG-LSTM
8
ACCEPTED MANUSCRIPT
coupling the forget gate and the input gate to avoid producing redundant
information [33]. The new value is added to the state when some old
information is forgotten at the same time. so we define it = 1 − ft ,we use
ft to denote the coupled gate. And we will replace Eq.5 as below:
T
ct = ft ◦ ct−1 + (1 − ft ) ◦ ĉt (7)
IP
• GRU
CR
Another popular variant is GRU[34] , which is proposed by Cho,2014. It
combines the forget gate and the input gate into a single update gate.
The unit state and the hidden state are also merged, so the GRU model is
US
simpler than the standard LSTM model. Figure 2 shows the structure of
GRU, we can know that GRU has only two gates, the reset gate and the
AN
update gate, In GRU, r and z jointly control how the new hidden state
st is obtained from the previous hidden state st−1 calculation, and the
output gate in LSTM is canceled. If the reset gate is 1 and the update
key is 0, and then, the GRU is completely degraded to an RNN.
M
• Bi-LSTM
ED
backward training sequences are two LSTMs., and both are connected to
an output layer. This structure provides the complete past and future
context information for each point in the input layer.
CE
ity.
9
ACCEPTED MANUSCRIPT
Softmax
Document Representation
Sentence Vector
T
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
IP
Word Embeddings
CR
Figure 2: The LSTM with sentence representations model for document level sentiment clas-
sification
US
of variable length. We can get a more delicate relationship between word em-
beddings, sentence representations in a document. Figure 2 shows the structure
AN
of LSTM with sentence representations. The idea comes from compositionality,
we all know that a document consist of a list of sentences and each sentence
consists of a list of words, the meaning of one sentence comes from the meanings
M
of words and the rules to combine them, the representation of one document
comes from the meanings of sentences. we use LSTMs with two layers, the
input of the first layer is all word embeddings [9] in the whole document, in
ED
word vectors can be pre-trained from text corpus with embedding learning al-
gorithms, such as Word2vec[24], Glove[25]. In our model, we adopt Glove to
CE
make better use of semantic and grammatical associations of words. Our model
first produces continuous sentence vectors from word representations. Because
RNNs or LSTM current time-step input contains both the output of the previ-
AC
ous time-step and the input of the current time-step, so we can use the output
of the last time-step in each sentence to represent the sentence, it is the sen-
tence representation which we put forward to represent sentence composition.
And all the sentence representations are the input of the second LSTM layer.
10
ACCEPTED MANUSCRIPT
After calculating the hidden vector of the second layer, we regard the last hid-
den vector as the document representation [35] . Document representations are
then used as features for document-level sentiment classification. We feed it
to a linear layer whose output length is class number,and add a softmax layer
T
to output the probability of classifying the document as positive, negative or
IP
neutral. Softmax function is calculated as follows, where C is the number of
sentiment categories.
CR
exp(xi )
sof tmaxi = PC (8)
o
io =1 exp(xi )
Our neural network model can handle shorter sequences and reduce the for-
US
getting rate of information and at the same time, our model not only consider
the semantic information between words, but also combine the semantic in-
AN
formation between sentences in the document which can encode the relations
between sentences in the semantic meaning of document.
tional words, degree words, negative words. Emotion words can be divided into
positive evaluation words, positive emotion words, negative evaluation words
specifically and negative emotion words and degree words will be divided into
AC
several levels, for example, we define most is the highest level, least is the low-
est level. The effect of negative words is to determine whether to reverse the
polarity of words.
Figure 3 shows the data structure of a general sentiment dictionary. We
assume that the degree of words is divided into three categories. The number of
11
ACCEPTED MANUSCRIPT
Sentiment
Dictionary
T
Negative Degree Emotional
words words words
IP
CR
Most Medium Least
degree degree degree
pos neg
US
Figure 3: The data structure of a general sentiment dictionary
sentence can be considered as positive and vice versa. Another way is that using
the sum scores of all words or phrases in a sentence to represent the emotional
polarity of the sentences. In our model ,we use the second approach. We can
PT
find that unlike methods of machine learning and neural networks, this method
does not take into account the semantic relations between words or sentences,
so it is not appropriate to deal with document level sentiment classification, it
CE
can not get a good result, but the performance of dealing with paragragh-level
or sentence-level sentiment analysis is still acceptable. In our model(LSTM
AC
with sorted sentence representations),we use this method to get the polarity of
sentences in each document.
12
ACCEPTED MANUSCRIPT
SR-
LSTM
Method with
T
Sentiment dictionary - +
IP
Maxn n>Maxn n<=Maxn
Maxn
CR
US n
Figure 4: The structure of SSR-LSTM
AN
3.5. LSTM with sorted sentence representations(SSR-LSTM)
Theoretically, LSTM can handle any length of sequences and avoid gradient
vanishing . To get a better training result, we will specify a maximum length of
M
sequences[37], the excess will be intercepted, and in our model, we will define
Maxn to represent the maximum number of sentences in each document. When
ED
one document which exceeds the maximum number of sentences, we will cut
off the extra sentences to satisfy the input. If the document have less than the
maximum number of sentences, we will fill the inadequacy of the place to do 0.
PT
When we deal with one document with the number of sentences beyond
CE
13
ACCEPTED MANUSCRIPT
sentence and use the NLP pipeline to divide the sentence into phrases. The
sentiment score of phrases is obtained by averaging the sentiment values of all
the same word synonyms, and if there is an adverb, the sentiment score of
the phrase is multiplied by a weight, finally, the scores of all the parts in a
T
sentence are added as the sentence sentiment score. Positive values represent
IP
positive, negative values represent negative, and finally we take absolute value
comparison to get the sentence sentiment polarity of the sort. Figure 4 shows
CR
the structure of SSR-LSTM, before entering SR-LSTM neural network model,
there is an added hidden layer to process data with sentiment dictionary. If
the number of sentences contained in a document is more than Maxn, we select
US
Maxn of sentences with strong polarity and if the number is less than Maxn, we
add 0 vectors to Maxn, then we input the processed data into SR-LSTM model.
The entire model is trained end-to-end with stochastic gradient descent,
AN
where the loss function is the cross-entropy error of supervised sentiment classi-
fication. In order to avoid overfitting, overfitting means the model over-divides
the training data, including the noise data, so that the least cost can be ob-
M
tained. However, the overall law is ignored and for unknown data, such as the
test data, the model can not perform well. In order to overcome this problem,
ED
4. Experiment
In this section, we study the result of our model on three real-world datasets
for document-level sentiment classification, and we compare the effect of dif-
14
ACCEPTED MANUSCRIPT
Dataset Train Size Valid Size Test Size Ws/Doc Sens/Doc Class
Yelp 2014 183019 22745 25399 196.9 11.41 5
Yelp 2015 194360 23652 25341 151.9 8.97 5
IMDB 67426 8381 9112 394.6 16.08 10
T
Table 1: Statistical information of IMDB and Yelp 2014/2015 datasets, Train size, Valid
IP
size and Test size represent the number of training set,validation set,test set. Ws/Doc and
Sens/Doc mean average number of words and sentences contained in each document. Class is
CR
the number of classes.
Datasets
Yelp 2014 120
US
Hidden layer units Maxn
11
Batch size
64
AN
Yelp 2015 120 9 64
IMDB 160 16 128
M
and Yelp 2015 are two restaurant review datasets . All the three datasets can
be publicly accessed.We use the Stanford CoreNLP [39] to do tokenization and
split sentences on these datasets. Table 1 shows the statistical information of the
CE
three datasets. We spilt three datasets into training, validation and testing sets
with 80/10/10. The training set is mainly used to train the model and to avoid
AC
overfitting, we can not adjust parameters directly through the effect on the test
dataset. We use the validation set to further determine the hyperparameters
of model and evaluate the effect of our model under different parameters. We
use accuracy and MSE to evaluate our models, where accuracy is a standard
metric to measure the overall sentiment classification performance. MSE(Mean
15
ACCEPTED MANUSCRIPT
Squared Error) is a convenient way to measure average error. From this, we can
evaluate the degree of change in the data, the smaller MSE values, indicating
that the predictive model describes the experimental data with better accuracy.
The MSE is calculated as follows:
T
PN 2
j (standardj − predictedj )
M SE = (10)
IP
N
CR
to make our models have a perfect performance. We use publicly available
300-dimensional Glove [25] as pre-trained word embeddings, the dimension of
hidden units is set to 120. We use Adagrad [40] as an optimizer and its initial
US
learning rate is set to 0.01. For IMDB, Yelp 2014, Yelp 2015, we set batch
size to 128, 64, 64. The number of hidden layer units is 120 for three datasets.
For Maxn in our models, it represents the maximum number of sentences. This
AN
parameter is selected based on the average number of sentences in the document
for each dataset. For example, we set Maxn to 16 for IMDB because the average
number of sentences in each document for IMDB is 16.08. Finally, Maxn is
M
chosen among (16,11,10), this is the best parameter for the three datasets. The
statistical information show in Table 2.
ED
We compare our methods with the following baseline methods for document-
level sentiment classification.We divide our baseline models into three categories.
In the first class, we exploit machine learning algorithm to build sentiment
AC
classifier.
• Naive Bayesian
16
ACCEPTED MANUSCRIPT
• SVM
we also use bag of words as features and train SVM classifier with LibLinear[41].
In the second class, we use recurrent neural networks to model long sequences
T
for document level sentiment classification.
IP
• RNN
CR
• LSTM
LSTM is a recurrent neural network with memory cells and three gate
US
mechanism (Hochreiter and Schmidhuber, 1997).
• PC-LSTM
AN
In comparison with standard LSTM, PC-LSTM[32] adds peephole con-
nection to the memory cell so that every gate can also accept the input
information of the cell state.
M
• CIFG-LSTM
CIFG-LSTM[33] combine the input and forget gate of LSTM and it re-
ED
• GRU
PT
It combines the forget gate and the input gate into a single update gate[34].
The unit state and the hidden state are also merged, so the GRU model
CE
• 2-layer LSTM
AC
17
ACCEPTED MANUSCRIPT
T
RNN 0.232 5.82 0.473 1.15 0.479 1.09
IP
LSTM 0.398 3.21 0.610 0.57 0.617 0.55
PC-LSTM 0.402 3.23 0.612 0.55 0.615 0.56
CR
CIFG-LSTM 0.395 3.17 0.607 0.59 0.610 0.57
GRU 0.405 3.15 0.609 0.55 0.611 0.56
2-layer LSTM 0.401 3.18 0.613 0.51 0.625 0.45
CLSTM
SR-LSTM
SSR-LSTM
0.429
0.440
0.443
US
2.67
2.24
2.25
0.624
0.632
0.639
0.49
0.46
0.45
0.627
0.639
0.638
0.47
0.46
0.44
AN
Bi-LSTM 0.432 2.21 0.625 0.49 0.625 0.48
SSR-BiLSTM 0.463 2.13 0.651 0.41 0.653 0.40
M
Table 3: Results of our model against baseline models on Yelp 2014,Yelp 2015 and IMDB.
Acc and MSE are evaluation metrics. Acc means accuracy(higher is better) and MSE means
mean square error(lower is better) Best results in each group are in gold.
ED
• CLSTM
PT
4.3. Results
other competitive models. The results are shown in Table 3,from this, we have
several findings.
1. First, we compare two machine learning methods. They are NB and
SVM(Naive Bayesian and Support Vector Machine), we can find that SVM has
18
ACCEPTED MANUSCRIPT
T
Designing effective features is a very fundamental work and the performance
IP
of a machine learning classifier is heavily dependent on the choice of data rep-
resentations and features, but neural network models can automatically learn
CR
features from the characteristics of the data, this is the reason that it’s widely
used for sentiment classification recently. From our experiment, we can include
LSTM and machine learning classifiers have almost the same performance, this
is a good news for us.
US
2. About the selected recurrent neural networks in baseline models, RNN
has the worst performance in modeling long texts due to the vanishing gradient
AN
problem. As a comparison, LSTM, PC-LSTM, CIFG-LSTM and GRU have
better performances which shows that an internal memory and the structure of
three gates play a key role in long texts modeling. The four LSTM models have
M
LSTM with sorted sentence representations (SSR-LSTM) have the best per-
formance in the three datasets and beats CLSTM proposed by [6] and 2-layer
LSTM [19], especially in Yelp 2014, SSR-LSTM achieves 0.639 in accuracy,
PT
which is 0.015 percent better than CLSTM and 0.026 percent better than 2-
layer LSTM. On IMDB and Yelp 2014 datasets, SSR-LSTM have a better per-
CE
formance, however, on Yelp 2015 datasets, SSR-LSTM have almost the same
performance as SR-LSTM, it shows that subjective sentences are rare emerged
at the end of each document on Yelp 2015 datasets.
AC
4. With the help of bidirectional architecture, models can look forward and
backward to capture features in modeling long texts, so Bi-LSTM has a better
performance than single directional models. And in bidirectional models, our
models have good performances, which achieves 0.463, 0.651, 0.653 in accuracy
in IMDB, Yelp 2014, Yelp 2015, 2.13, 0.41, 0.40 in MSE.
19
ACCEPTED MANUSCRIPT
T
els only use the output sentence vector of the first layer as the input of the second
IP
layer, so our models have less parameters and computational time.
CR
We know that the input of neural networks is word embeddings[9]. It is
well accepted that a good word embedding is crucial to composing text repre-
US
sentations at a higher level. We therefore want to know the effects of differ-
ent word embeddings on our models. We choose IMDB as our document-level
dataset for sentiment classification. We compare randomly initialized vectors
AN
, two word2vec models(CBOW and Skip-gram) [42] and glove [25] on three
models, which are LSTM, SR-LSTM, SSR-LSTM. All the word vectors are 300-
dimensional and learned from Twitter.
M
Table 4: Classification accuracy of LSTM, SR-LSTM and SSR-LSTM with different di-
CE
Randomly initialized vectors mean that in the model, we deal with word em-
beddings the same as other parameters, we randomly initialize word embeddings
and other parameters with sampling from uniform distribution in [-0.1,0.1] and
update all parameters with stochastic gradient. Afterwards, distributed repre-
sentation, another saying is word embeddings is proposed by Hinton in 1986
20
ACCEPTED MANUSCRIPT
T
obtaining global context[25].
IP
From Table 6, we can find that two models (CBOW and Skip-gram) of
word2vec and Glove perform better than Randomly initialized vectors, espe-
CR
cially in SSR-LSTM. This shows the importance of context information for word
embedding learning such as Word2vec and Glove. In addition, we can also find
that Glove have a slight increase in accuracy on three models, which indicates
US
the important of global context for estimating a good word representation.
We compare between Glove vectors with different dimensions(50/100/200/300).
Classification accuracy and time cost are given in Table 4 and Table 5 respec-
AN
tively. We can find that 200-dimensional word vectors perform better than
50-dimensional and 100-dimensional word vectors, while 300-dimensional word
embeddings do not show significant improvements. Furthermore, SR-LSTM and
M
Table 5: Time cost of each model with 50dms, 100dms, 200dms and 300dms Glove vectors.
Each value means how many minutes cost after training the model
CE
Table 6: Classification accuracy of LSTM, SR-LSTM and SSR-LSTM with different word
embeddings. We compare between Randomly initialized vectors, CBOW, Skip-gram and
Glove vectors.
21
ACCEPTED MANUSCRIPT
48 35
46 30
44 25
Time costing(min)
Accuracy(%)
42 20
40 15
T
38 10
36 5
IP
34 0
8 10 12 14 16 18 20 22 24 8 10 12 14 16 18 20 22 24
The number of Maxn The number of Maxn
CR
Figure 5: The classification accuracy and time costing of SSR-LSTM on IMDB dataset.
SSR-LSTM have similar time cost and they cost more time than LSTM because
the parameter number of SR-LSTM and SSR-LSTM is higher. But they can
higher classification accuracy.
US
AN
4.5. Effectiveness on Maxn
tively shown in Figure 5. From the tendency of two figures, we find that when
the selected Maxn is bigger, the classification accuracy is improved gradually,
especially Maxn from 10 to 16, the increase of accuracy is obvious, and when
PT
Maxn increased to 24, we find the increase of accuracy slowly and decline also
occurs during 22 to 24. The reason is that when Maxn is too large, the number
CE
of documents which contain sentences more than Maxn will be less and less,
so that the impact to accuracy is getting smaller and smaller. From figure 5 ,
we can find with the increase of Maxn, the cost of training time is longer and
AC
longer, even it increases faster because the neural network model needs to train
more parameters as Maxn increases which inceases the training time.
So in our model SSR-LSTM, it is reasonable to value Maxn as the aver-
age number of sentences in each document on three datasets because this can
22
ACCEPTED MANUSCRIPT
not consume too much time and also can ensure to have a high classification
accuracy.
5. Conclusion
T
We introduce new neural network models(SR-LSTM and SSR-LSM) for doc-
IP
ument level sentiment classification. SR-LSTM exploits two layers LSTM mod-
els. The approach encodes semantics of sentences and their relations in doc-
CR
ument representation. Since a document consists of many sentences and one
sentence has a list of words, our model models document in two steps. The first
layer uses word embedding technique to produce sentence vectors and sentence
US
vectors are treated as inputs of the second layer to get document representa-
tions. SSR-LSTM is an approach to improve SR-LSTM which first removes
AN
sentences sentences with less emotional polarity in datasets. Before data is in-
put to SR-LSTM model, we clean three datasets first. We do not change the
order of sentences as a basis and choose fixed number’s sentences according to
M
the sentiment polarity in each document. For SSR-LSTM, such a sorted input
can build a more perfect model. Our two models are trained end-to-end with su-
pervised sentiment classification objectives. Empirical results on three datasets
ED
(IMDB,Yelp 2014, Yelp 2015) shows that our models outperform state-of-the-art
models.
PT
For future work, we want to find a better way to get sentence vectors. In our
model, we simply model document in a sequential way. One could compose doc-
ument representation over discourse tree structures rather than in a sequential
CE
6. Acknowledgments
AC
We thank the constructive work from Guozheng Rao for the fruitful discus-
sions, besides, we would like to thank the anonymous reviews for their valuable
comments. This work is supported by the National Natural Science Foundation
of China (NSFC). (61373165, 61373035 and 61672377).
23
ACCEPTED MANUSCRIPT
References
T
IP
[2] T. B. Pang, B. Pang, L. Lee, Thumbs up? sentiment classification using
machine learning, Proceedings of Emnlp (2002) 79–86.
CR
[3] P. Liu, X. Qiu, X. Huang, Recurrent neural network for text classification
with multi-task learning, in: International Joint Conference on Artificial
Intelligence, 2016, pp. 2873–2879.
US
[4] B. O’Connor, R. Balasubramanyan, B. R. Routledge, N. A. Smith, From
tweets to polls: Linking text sentiment to public opinion time series, in:
AN
International Conference on Weblogs and Social Media, Icwsm 2010, Wash-
ington, Dc, Usa, May, 2010.
M
[6] J. Xu, D. Chen, X. Qiu, X. Huang, Cached long short-term memory neural
networks for document-level sentiment classification (2016) 1660–1669.
[7] D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neural
PT
sentiment treebank.
24
ACCEPTED MANUSCRIPT
[10] B. Pang, L. Lee, Seeing stars: exploiting class relationships for sentiment
categorization with respect to rating scales, in: Meeting on Association for
Computational Linguistics, 2005, pp. 115–124.
[11] P. Bo, L. Lee, Opinion Mining and Sentiment Analysis, Now Publishers
T
Inc., 2008.
IP
[12] P. D. Turney, et al. thumbs up or thumbs down? semantic orientation
applied to unsupervised classification of reviews, Proceedings of Annual
CR
Meeting of the Association for Computational Linguistics (2002) 417–424.
[13] A. B. Goldberg, X. Zhu, Seeing stars when there aren’t many stars: graph-
US
based semi-supervised learning for sentiment categorization, in: The Work-
shop on Graph Based Methods for Natural Language Processing, 2006, pp.
45–52.
AN
[14] S. Wang, C. D. Manning, Baselines and bigrams: simple, good sentiment
and topic classification, in: Meeting of the Association for Computational
M
[15] R. Xia, C. Zong, Exploring the use of word relation features for sentiment
ED
[16] G. Ifrim, G. Weikum, The bag-of-opinions method for review rating predic-
tion from sparse text patterns, in: COLING 2010, International Conference
PT
25
ACCEPTED MANUSCRIPT
T
2014, pp. 273–278.
IP
[22] H. Zhang, J. Li, Y. Ji, H. Yue, Understanding subtitles by character-level
CR
sequence-to-sequence learning, IEEE Transactions on Industrial Informat-
ics 13 (2) (2017) 616–624.
US
neural networks 4 (2014) 3104–3112.
[26] Q. Qian, B. Tian, M. Huang, Y. Liu, X. Zhu, X. Zhu, Learning tag embed-
dings and tag-specific composition functions in recursive neural network,
in: Meeting of the Association for Computational Linguistics and the In-
PT
26
ACCEPTED MANUSCRIPT
T
IP
[31] D. Tang, B. Qin, X. Feng, T. Liu, Effective lstms for target-dependent
sentiment classification, Computer Science.
CR
[32] F. A. Gers, J. Schmidhuber, Recurrent nets that time and count, in: Ieee-
Inns-Enns International Joint Conference on Neural Networks, 2000, pp.
189–194 vol.3.
US
[33] J. Chung, C. Gulcehre, K. H. Cho, Y. Bengio, Empirical evaluation of gated
recurrent neural networks on sequence modeling, Eprint Arxiv.
AN
[34] K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, Y. Bengio, Learning phrase representations using rnn encoder-
decoder for statistical machine translation, Computer Science.
M
products for document level sentiment classification, in: Meeting of the As-
sociation for Computational Linguistics and the International Joint Con-
ference on Natural Language Processing, 2015, pp. 1014–1023.
PT
27
ACCEPTED MANUSCRIPT
T
[40] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online
IP
learning and stochastic optimization, Journal of Machine Learning Re-
search 12 (7) (2011) 257–269.
CR
[41] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, C. J. Lin, Liblinear: A
library for large linear classification, Journal of Machine Learning Research
9 (9) (2008) 1871–1874.
US
[42] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word
AN
representations in vector space, Computer Science.
Department, 2001.
ED
PT
CE
AC
28
ACCEPTED MANUSCRIPT
Biography
T
IP
Weihang Huang School of Computer Science and Technology, Tianjin Uni-
CR
versity, Tianjin, China Corresponding author Mail: [email protected]
Guozheng Rao Tianjin Key Laboratory of Cognitive Computing and Ap-
plication, Tianjin, China Mail: [email protected]
US
Zhiyong Feng Tianjin Key Laboratory of Cognitive Computing and Ap-
plication, Tianjin, China and School of Computer Software, Tianjin University,
AN
Tianjin, China
Qiong Cong Tianjin Key Laboratory of Cognitive Computing and Appli-
cation, Tianjin, China *Biography of the aut
M
ED
PT
CE
AC
29