0% found this document useful (0 votes)
24 views30 pages

Huang 2018

This document describes a new neural network model called SR-LSTM that was developed to improve document-level sentiment classification. The model uses a two-layer LSTM structure, with the first layer representing sentences and the second layer representing the document. It outperforms other models on three publicly available datasets. The model aims to better capture semantic relationships between sentences to determine overall document sentiment.

Uploaded by

bilal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views30 pages

Huang 2018

This document describes a new neural network model called SR-LSTM that was developed to improve document-level sentiment classification. The model uses a two-layer LSTM structure, with the first layer representing sentences and the second layer representing the document. It outperforms other models on three publicly available datasets. The model aims to better capture semantic relationships between sentences to determine overall document sentiment.

Uploaded by

bilal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Communicated by Prof. H.

Zhang

Accepted Manuscript

LSTM with sentence representations for Document-level Sentiment


Classification

Weihang Huang, Guozheng Rao, Zhiyong Feng, Qiong Cong

PII: S0925-2312(18)30479-X
DOI: 10.1016/j.neucom.2018.04.045
Reference: NEUCOM 19512

To appear in: Neurocomputing

Received date: 30 November 2017


Revised date: 6 March 2018
Accepted date: 20 April 2018

Please cite this article as: Weihang Huang, Guozheng Rao, Zhiyong Feng, Qiong Cong, LSTM with
sentence representations for Document-level Sentiment Classification, Neurocomputing (2018), doi:
10.1016/j.neucom.2018.04.045

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

LSTM with sentence representations for


Document-level Sentiment Classification

Weihang Huang

T
School of Computer Science and Technology, Tianjin University, Tianjin, China

Guozheng Raoa,∗, Zhiyong Fenga,b , Qiong Conga

IP
a Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China
b School of Computer Software, Tianjin University, Tianjin, China

CR
Abstract

US
Recently, due to their ability to deal with sequences of different lengths, neural
networks have achieved a great success on sentiment classification. It is widely
AN
used on sentiment classification. Especially long short-term memory networks.
However, one of the remaining challenges is to model long texts to exploit the
semantic relations between sentences in document-level sentiment classification.
M

Existing Neural network models are not powerful enough to capture enough
sentiment messages from relatively long time-steps. To address this problem,
we propose a new neural network model(SR-LSTM) with two hidden layers. The
ED

first layer learns sentence vectors to represent semantics of sentences with long
short term memory network, and in the second layer, the relations of sentences
PT

are encoded in document representation. Further, we also propose an approach


to improve it which first clean datasets and remove sentences with less emotional
polarity in datasets to have a better input for our model. The proposed models
CE

outperform the state-of-the-art models on three publicly available document-


level review datasets.
Keywords: Sentiment Classification, LSTM, Neural networks, Sentence
AC

vectors

∗ Correspondingauthor
Email address: [email protected] (Guozheng Rao )

Preprint submitted to Elsevier May 3, 2018


ACCEPTED MANUSCRIPT

1. Introduction

Sentiment classification is one of the most widely used natural language


processing techniques in many areas, such as E-commerce websites, stock fore-
cast, political orientation analyses [1]. Document-level sentiment classification

T
is a fundamental task in sentiment analysis [2]. Recently, neural networks ap-

IP
proaches have gained a big success on sentiment classification. Neural network
is used to deal with sentiment classification early with frees researchers from

CR
handcrafted feature engineering[3]. Among these methods, Recurrent Neural
Networks (RNNs) are one of the most prevalent architectures because of the
ability to handle variable length texts.

US
Paragragh-level or sentence-level sentiment analysis expects the model to
extract features from limited source of information [4], while document-level
AN
sentiment analysis demands more on selecting and storing global sentiment mes-
sage from long texts with noises and redundant local pattern. Simple Recurrent
Neural Networks are not powerful enough to handle the overflow and to capture
M

key sentiment messages from relatively far time-steps.


Efforts have been made to deal with long texts modeling process. Recently,
LSTM is more popular to deal with sentiment classification. Long short-term
ED

memory network(LSTM) is proposed by Hochreiter and Schmidhuber in 1997[5],


it is a typical recurrent neural network, which alleviates the problem of gradi-
PT

ent diffusion and explosion. LSTM can capture the long dependencies in a
sequence by introducing a memory unit and a gate mechanism which aims to
decide how to utilize and update the information kept in memory cell. A cached
CE

long short-term memory model is proposed to deal with document-level senti-


ment analysis[6]. Duyu Tang used gated recurrent neural network to model
document[7]. These models all exploit neural network to learn semantic rela-
AC

tions between all words in a document, but they are not capable of modeling
the intrinsic relations between sentences.
Partially inspired by the structure of LSTM and semantic compositionality[8],
we propose a deep recurrent neural network with two hidden layers. It first pro-

2
ACCEPTED MANUSCRIPT

duces continuous sentence vectors from word representations. it is the sentence


vector which we put forward to represent sentence composition. And all the sen-
tence vectors of the whole document is the input of the second LSTM layer to get
document representations and the last output vector represents document rep-

T
resentation, which takes considerations of semantics with different granularities.

IP
These document representations are used as features to classify the sentiment
label of each document. We conduct document level sentiment classification on

CR
three large-scale review datasets which come from IMDB, Yelp 2014 and Yelp
2015. We compare our model to three classes of model which contain machine
learning methods[9], recurrent neural network models and bi-directional neural
network models.
Our main contributions are US
AN
• We introduce a neural network model with two hidden layers to learn
continuous document representation. The document datasets are divided
into a certain number of sentences so that we can handle shorter sequences
M

each time and retain as many key sentiment information as possible.

• The proposed model can encode the relations between sentences in doc-
ED

ument representation. It is a neural network model which can capture


semantic information between words and sentences.

• Our model outperforms state-of-the-art methods for Document level senti-


PT

ment classification on three document level datasets from IMDB and Yelp
Dataset Challenge.
CE

2. Related work
AC

In this section, we introduce our works in two areas: First, we discuss the
meaning of document-level sentiment classification and the existing approaches;
Second, we discuss recurrent neural networks for Document-level sentiment clas-
sification.

3
ACCEPTED MANUSCRIPT

2.1. Document-level Sentiment Classification

Sentiment classification is a new research topic in NLP which has a great


research and application value. Document-level sentiment classification is a diffi-
cult task in sentiment classification which aims at identifying the sentiment label

T
of a document[10]. Pang and Lee put forward the concept of document-level sen-

IP
timent for the first time[11]. The biggest challenge of Document-level sentiment
classification is not only to consider the semantics between words and sentences,

CR
but also to consider the overall context of semantic information to represent doc-
ument composition. We can not simply use the sum of all word representations
to represent the whole document. This is obviously not justified. Various meth-

US
ods have been investigated and explored over years. Most of these methods
depend on traditional machine learning algorithms. Pang first use a supervised
machine learning method and build a SVM classifier and represent documents
AN
with bag-of-words features [2]. Turney uses sentiment phrases extracted from
syntactic patterns for document level sentiment classification[12]. Goldberg uses
a graph-based method in a semi-supervised setting to finish this task[13]. Many
M

research results show that SVM and Naive bayesian classifier have better perfor-
mance than other machine learning methods. Afterwards, Various approaches
ED

focus on design handcrafted features to make machine learning methods have


a good performance, including [14] use word ngrams as features, [15] explore
text topic and [16] use bag of opinions. but they are heavily in need of effective
PT

handcrafted features.

2.2. Methods for modeling distributed representations


CE

Most models for distributed representations fall into three classes where
realvalued vectors are used to represent meaning. The three classes are bags-
AC

of-words models, sequence models and tree-structured models. Phrase and sen-
tence representations are independent of word order in bags-of-words models.
They can be generated by averaging constituent word representations[17]. How-
ever, sequence models models construct sentence representations as an order-
sensitive function of the sequence of tokens[18]. Recently, tree-structured models

4
ACCEPTED MANUSCRIPT

compose each phrase sentence representations from its constituent subphrases


according to a given syntactic structure over the sentence[19].
Recurrent neural networks are a natural choice for sequence modeling tasks
due to their capability for processing arbitrary-length sequences. Especially

T
Long Short-term Memory networks have emerged as a popular model due to

IP
their effectiveness at capturing long-term dependencies. LSTM networks have
been successfully applied to a variety of fields, machine translation [20], speech

CR
recognition [21], understanding subtitles[22] and image caption generation [23].

2.3. Neural networks for Document-level Sentiment Classification

US
Recently, the use of neural network based methods to deal with sentiment
classification gradually becomes popular. They are prevalent due to their abil-
ity of learning discriminative features from data[23], and they can also think
AN
about the overall context information. With the development of distributed
representations, neural networks advance sentiment classification substantially.
It is known that good word embeddings as inputs can improve neural net-
M

work models[9], a simple and effective approach to learn distributed repre-


sentations was proposed by Mikolov[24] which introduces CBOW and Skip-
ED

gram models. GloVe is an unsupervised learning algorithm for obtaining global


context[25].These distributed representations can greatly enhance the ability of
neural network. Neural network models contain Recursive Neural Network[26],Recurrent
PT

Neural Network[27],LSTM[5]. Among them, RNNs can handle sequences better


because it can take into account context information of sequences. Unfortu-
nately, a problem of RNNs is that during training, components of the gradient
CE

vector can grow or vanish exponentially over long sequences [28]. However, long
short-term neural networks, it can solve this biggest problem of RNNsvanish-
AC

ing gradient problem. Variable models based on LSTM have been proposed to
increase the ability of LSTMs, such as [19] put a tree-structured model into
LSTM for better semantic composition, [29] add an extra sparse matrix into
LSTMs to get a better results and so on. Most of these models work well on
sentence-level and paragragh-level sentiment classification. When it comes to

5
ACCEPTED MANUSCRIPT

the document-level sentiment classification, the effect is not always prefect.


Although it is widely accepted that LSTM has more long-lasting memory
units than RNNs. It will still forget information which is too far away from
the current point. This problem is more pronounced when we deal with the

T
document-level sentiment classification. In order for LSTM to store longer in-

IP
formation, various models about LSTM have proposed to increase the ability
of LSTMs to store long-range information. For example, [30] put an external

CR
memory into LSTM, but they are of poor performance on time because of the
huge external memory matrix. [31] use a bidirectional LSTM with a attention
model to deal with document-level sentiment classification and [6]came up with

US
CLSTM, it defines a concept of forgetting rates and divides memory into several
groups, and different forgetting rates regarded as filters, are assigned to different
groups.
AN
We can conclude that these methods are proposed on a layer of LSTM model
and modify LSTM structure. As a comparison, we propose a LSTM model with
two hidden layers. the first layer is trained to acquire sentence vectors, a simple
M

method to get sentence vectors is ignoring the order of sentences and averaging
word embeddings as sentence vectors, but this fails to capture complex semantic
ED

relations between words, one could use CNN to get sentence vectors [7]. we try
the first LSTM to achieve this, and the second layer uses sentence vectors to
draw the overall sentiment polarity of the document as document composition,
PT

which can overcome standard LSTM model cant store too long information, but
also through the structure of two layers of LSTM, we can take considerations
CE

of more specific semantics.

3. The Proposed Method


AC

In this section, we will review LSTM and some of its variants, and then we
will introduce our method and an approach to improve it.

6
ACCEPTED MANUSCRIPT

3.1. Long Short-term Memory Networks


Long short-term memory network(LSTM)[5] is a typical recurrent neural
network. It is to modify the structure of the momery cell in the RNNs through
transforming the tanh layer in the RNNs into a structure containing a memory

T
unit and a gate mechanism. This aims to decide how to utilize and update the

IP
information kept in the memory cell. Because of this structure, it alleviates the
problem of gradient diffusion and explosion. Figure 1 shows the structure of a

CR
standard LSTM at time step t. i, o, f represent the input gate, the output gate
and forget gate. c is the memory cell, the inputs of an input gate it , an output
gate ot , an forget gate ft are all vector xt that the network receives at time t
and its previous hidden size ht−1 .
US
AN

i o
M
ED
PT

f
CE
AC

Figure 1: The structure of a standard LSTM unit

Formally, the formula of each LSTM component can be formalized as:

ft = σ(Wf xt + Uf ht−1 + bf ) (1)

it = σ(Wi xt + Ui ht−1 + bi ) (2)

7
ACCEPTED MANUSCRIPT

ot = σ(Wo xt + Uo ht−1 + bo ) (3)

ĉt = tanh(Wc xt + Uc ht−1 + bc ) (4)

ct = ft ◦ ct−1 + it ◦ ĉt (5)

T
ht = ot ◦ tanh(ct ) (6)

IP
Where σ is the logistic sigmoid function, ft , it , ot , ct are the forget gate,the
input gate, the output gate, and the memory cell activation vector at time-step

CR
t. The entries of the gating vectors it , ft , ot are in [0,1]. ◦ is the multiplica-
tion operation. b represents bias, all of which have the same size as ht ∈ Rh ,
Wf , Wi , Wo , Wc ∈ RH×d , we refer to d as the memory dimension of the LSTM

US
and bf , bi , bo , bc ∈ RH , and Uf , Ui , Uo , Uc ∈ RH×H . The dimensionality of hid-
den layer and input respectively are H and h.
AN
3.2. Some variable species of LSTM

Since LSTM has been put forward, it has been widely concerned . In recent
years, LSTM in the field of natural language processing is also more and more
M

widely used, many deformations on the LSTM have been proposed.

• PC-LSTM
ED

One of the popular variants was proposed by Gers and Schmidhuber in


2000 [32], it adds peephole connection to the memory cell so that every
PT

gate can also accept the input information of the cell state. Compared
with standard LSTM, In the PC-LSTM, the forget gate, the input gate
and the output gate can also accept the whole information in the memory
CE

cell under every time-step. In this structure, it adds three connections


which connect the current memory cell to the forget gate, the input gate,
AC

and the output gate.

• CIFG-LSTM

In order to simplify the structure of the LSTM, In the CIFG-LSTM ,


the input gate and forget gate are coupled as one uniform gate, We use

8
ACCEPTED MANUSCRIPT

coupling the forget gate and the input gate to avoid producing redundant
information [33]. The new value is added to the state when some old
information is forgotten at the same time. so we define it = 1 − ft ,we use
ft to denote the coupled gate. And we will replace Eq.5 as below:

T
ct = ft ◦ ct−1 + (1 − ft ) ◦ ĉt (7)

IP
• GRU

CR
Another popular variant is GRU[34] , which is proposed by Cho,2014. It
combines the forget gate and the input gate into a single update gate.
The unit state and the hidden state are also merged, so the GRU model is

US
simpler than the standard LSTM model. Figure 2 shows the structure of
GRU, we can know that GRU has only two gates, the reset gate and the
AN
update gate, In GRU, r and z jointly control how the new hidden state
st is obtained from the previous hidden state st−1 calculation, and the
output gate in LSTM is canceled. If the reset gate is 1 and the update
key is 0, and then, the GRU is completely degraded to an RNN.
M

• Bi-LSTM
ED

A bidirectional LSTM [21] consist of two LSTMs, which utilizes addi-


tional backward information and thus enhances the memory capability.
The basic idea of bidirectional LSTM is to propose that the forward and
PT

backward training sequences are two LSTMs., and both are connected to
an output layer. This structure provides the complete past and future
context information for each point in the input layer.
CE

We also employ the bi-directional mechanism on SSR-LSTM which utilizes


additional backward information and thus enhances the memory capabil-
AC

ity.

3.3. LSTM with sentence representations(SR-LSTM)


We propose a end-to-end LSTM model with two layers designed to get a doc-
ument representation for sentiment classification, which can handle documents

9
ACCEPTED MANUSCRIPT

Softmax

Document Representation

LSTM LSTM LSTM

Sentence Vector

T
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

IP
Word Embeddings

The first sentence The second sentence The Maxn sentence

CR
Figure 2: The LSTM with sentence representations model for document level sentiment clas-
sification

US
of variable length. We can get a more delicate relationship between word em-
beddings, sentence representations in a document. Figure 2 shows the structure
AN
of LSTM with sentence representations. The idea comes from compositionality,
we all know that a document consist of a list of sentences and each sentence
consists of a list of words, the meaning of one sentence comes from the meanings
M

of words and the rules to combine them, the representation of one document
comes from the meanings of sentences. we use LSTMs with two layers, the
input of the first layer is all word embeddings [9] in the whole document, in
ED

word embeddings, every words is represented as a low dimensional vector. All


the word vectors are stacked in a word embedding matrix Lw ∈ Rd×|V | , where
|V | is the size of vocabulary and d is the dimension of word embeddings. These
PT

word vectors can be pre-trained from text corpus with embedding learning al-
gorithms, such as Word2vec[24], Glove[25]. In our model, we adopt Glove to
CE

make better use of semantic and grammatical associations of words. Our model
first produces continuous sentence vectors from word representations. Because
RNNs or LSTM current time-step input contains both the output of the previ-
AC

ous time-step and the input of the current time-step, so we can use the output
of the last time-step in each sentence to represent the sentence, it is the sen-
tence representation which we put forward to represent sentence composition.
And all the sentence representations are the input of the second LSTM layer.

10
ACCEPTED MANUSCRIPT

After calculating the hidden vector of the second layer, we regard the last hid-
den vector as the document representation [35] . Document representations are
then used as features for document-level sentiment classification. We feed it
to a linear layer whose output length is class number,and add a softmax layer

T
to output the probability of classifying the document as positive, negative or

IP
neutral. Softmax function is calculated as follows, where C is the number of
sentiment categories.

CR
exp(xi )
sof tmaxi = PC (8)
o
io =1 exp(xi )
Our neural network model can handle shorter sequences and reduce the for-

US
getting rate of information and at the same time, our model not only consider
the semantic information between words, but also combine the semantic in-
AN
formation between sentences in the document which can encode the relations
between sentences in the semantic meaning of document.

3.4. A tool for sentence emotional polarity


M

Before introducing an approach to improve our model, let us talk about


sentiment dictionary which is a tool for sentence emotional polarity. Sentiment
ED

dictionary contains sentiment scores of words. The popular sentiment dictionary


includes GI, NTU, HowNet, sentiWordNet which is a lexical resource for opinion
mining. The construction of sentiment dictionary is generally used to expand
PT

the seed sentiment dictionary to obtain a large-scale sentiment dictionary[36].


Words in sentiment dictionary are divided into three categories, which are emo-
CE

tional words, degree words, negative words. Emotion words can be divided into
positive evaluation words, positive emotion words, negative evaluation words
specifically and negative emotion words and degree words will be divided into
AC

several levels, for example, we define most is the highest level, least is the low-
est level. The effect of negative words is to determine whether to reverse the
polarity of words.
Figure 3 shows the data structure of a general sentiment dictionary. We
assume that the degree of words is divided into three categories. The number of

11
ACCEPTED MANUSCRIPT

Sentiment
Dictionary

T
Negative Degree Emotional
words words words

IP
CR
Most Medium Least
degree degree degree
pos neg

US
Figure 3: The data structure of a general sentiment dictionary

categories of degree words in different sentiment dictionaries is different. First,


AN
we spilt sentences into words and phrases. We use the NLP pipeline to achieve
this and tag part of speech of words. This method for sentiment classification
has the advantages of simple thinking and small workload. Judging the polarity
M

of a sentence is usually done by comparing the number of emotional words in a


sentence. If there are more positive words than negative words in a sentence, the
ED

sentence can be considered as positive and vice versa. Another way is that using
the sum scores of all words or phrases in a sentence to represent the emotional
polarity of the sentences. In our model ,we use the second approach. We can
PT

find that unlike methods of machine learning and neural networks, this method
does not take into account the semantic relations between words or sentences,
so it is not appropriate to deal with document level sentiment classification, it
CE

can not get a good result, but the performance of dealing with paragragh-level
or sentence-level sentiment analysis is still acceptable. In our model(LSTM
AC

with sorted sentence representations),we use this method to get the polarity of
sentences in each document.

12
ACCEPTED MANUSCRIPT

SR-
LSTM

Method with

T
Sentiment dictionary - +

IP
Maxn n>Maxn n<=Maxn
Maxn

CR
US n
Figure 4: The structure of SSR-LSTM
AN
3.5. LSTM with sorted sentence representations(SSR-LSTM)

Theoretically, LSTM can handle any length of sequences and avoid gradient
vanishing . To get a better training result, we will specify a maximum length of
M

sequences[37], the excess will be intercepted, and in our model, we will define
Maxn to represent the maximum number of sentences in each document. When
ED

one document which exceeds the maximum number of sentences, we will cut
off the extra sentences to satisfy the input. If the document have less than the
maximum number of sentences, we will fill the inadequacy of the place to do 0.
PT

When we deal with one document with the number of sentences beyond
CE

the maximum number of sentences, we will intercept one or a few sentences


from the back. If the subjective comments are at the end of the document,
we may lost very important sentiment information, so we propose LSTM with
AC

sorted sentence representations to overcome it. We do not change the order of


sentences as a basis and choose Maxn’s sentences according to the sentiment
polarity in each document. In order to judge the sentiment polarity of each
sentence, we use sentiment dictionary to finish this work. We use the sentiment
dictionary SentiWordNet 3.0 [38] to get the sentiment score of each word in a

13
ACCEPTED MANUSCRIPT

sentence and use the NLP pipeline to divide the sentence into phrases. The
sentiment score of phrases is obtained by averaging the sentiment values of all
the same word synonyms, and if there is an adverb, the sentiment score of
the phrase is multiplied by a weight, finally, the scores of all the parts in a

T
sentence are added as the sentence sentiment score. Positive values represent

IP
positive, negative values represent negative, and finally we take absolute value
comparison to get the sentence sentiment polarity of the sort. Figure 4 shows

CR
the structure of SSR-LSTM, before entering SR-LSTM neural network model,
there is an added hidden layer to process data with sentiment dictionary. If
the number of sentences contained in a document is more than Maxn, we select

US
Maxn of sentences with strong polarity and if the number is less than Maxn, we
add 0 vectors to Maxn, then we input the processed data into SR-LSTM model.
The entire model is trained end-to-end with stochastic gradient descent,
AN
where the loss function is the cross-entropy error of supervised sentiment classi-
fication. In order to avoid overfitting, overfitting means the model over-divides
the training data, including the noise data, so that the least cost can be ob-
M

tained. However, the overall law is ignored and for unknown data, such as the
test data, the model can not perform well. In order to overcome this problem,
ED

we join an L2 regularization into all parameters. L2 regularization is used to


limit the size of weights so that the model can not randomly fit the random
noise in the training data. Let y be the target distribution for each document,
PT

z be the predicted document distribution. The goal of training is to minimize


the cross-entropy error between y and z for all training documents.
XX j λ||θ||2
CE

loss = − yi logzij + (9)


i j
2

Where i is the index of document, j is the index of class. λ is the L2


AC

regularization term. θ is the parameter set.

4. Experiment

In this section, we study the result of our model on three real-world datasets
for document-level sentiment classification, and we compare the effect of dif-

14
ACCEPTED MANUSCRIPT

Dataset Train Size Valid Size Test Size Ws/Doc Sens/Doc Class
Yelp 2014 183019 22745 25399 196.9 11.41 5
Yelp 2015 194360 23652 25341 151.9 8.97 5
IMDB 67426 8381 9112 394.6 16.08 10

T
Table 1: Statistical information of IMDB and Yelp 2014/2015 datasets, Train size, Valid

IP
size and Test size represent the number of training set,validation set,test set. Ws/Doc and
Sens/Doc mean average number of words and sentences contained in each document. Class is

CR
the number of classes.

ferent word embeddings and different dimensions of word embeddings on our


proposed model.

Datasets
Yelp 2014 120
US
Hidden layer units Maxn
11
Batch size
64
AN
Yelp 2015 120 9 64
IMDB 160 16 128
M

Table 2: Optimal hyper-parameter configuration for three datasets.

4.1. Experimental Setting


ED

For document-level sentiment classification, we evaluate our model on three


popular real-world datasets. IMDB is a large movie review dataset, Yelp 2014
PT

and Yelp 2015 are two restaurant review datasets . All the three datasets can
be publicly accessed.We use the Stanford CoreNLP [39] to do tokenization and
split sentences on these datasets. Table 1 shows the statistical information of the
CE

three datasets. We spilt three datasets into training, validation and testing sets
with 80/10/10. The training set is mainly used to train the model and to avoid
AC

overfitting, we can not adjust parameters directly through the effect on the test
dataset. We use the validation set to further determine the hyperparameters
of model and evaluate the effect of our model under different parameters. We
use accuracy and MSE to evaluate our models, where accuracy is a standard
metric to measure the overall sentiment classification performance. MSE(Mean

15
ACCEPTED MANUSCRIPT

Squared Error) is a convenient way to measure average error. From this, we can
evaluate the degree of change in the data, the smaller MSE values, indicating
that the predictive model describes the experimental data with better accuracy.
The MSE is calculated as follows:

T
PN 2
j (standardj − predictedj )
M SE = (10)

IP
N

For the configuration of parameters, we want to select the best parameters

CR
to make our models have a perfect performance. We use publicly available
300-dimensional Glove [25] as pre-trained word embeddings, the dimension of
hidden units is set to 120. We use Adagrad [40] as an optimizer and its initial

US
learning rate is set to 0.01. For IMDB, Yelp 2014, Yelp 2015, we set batch
size to 128, 64, 64. The number of hidden layer units is 120 for three datasets.
For Maxn in our models, it represents the maximum number of sentences. This
AN
parameter is selected based on the average number of sentences in the document
for each dataset. For example, we set Maxn to 16 for IMDB because the average
number of sentences in each document for IMDB is 16.08. Finally, Maxn is
M

chosen among (16,11,10), this is the best parameter for the three datasets. The
statistical information show in Table 2.
ED

For model initialization, we randomly initialize all matrices with sampling


from uniform distribution in [-0.1,0.1] and update all parameters with stochastic
gradient.
PT

4.2. Baseline models


CE

We compare our methods with the following baseline methods for document-
level sentiment classification.We divide our baseline models into three categories.
In the first class, we exploit machine learning algorithm to build sentiment
AC

classifier.

• Naive Bayesian

Naive Bayesian is a popular machine learning classification algorithm and


we use bags of words[11] as features.

16
ACCEPTED MANUSCRIPT

• SVM

we also use bag of words as features and train SVM classifier with LibLinear[41].

In the second class, we use recurrent neural networks to model long sequences

T
for document level sentiment classification.

IP
• RNN

RNN is a basic method to model sequential texts[27].

CR
• LSTM

LSTM is a recurrent neural network with memory cells and three gate

US
mechanism (Hochreiter and Schmidhuber, 1997).

• PC-LSTM
AN
In comparison with standard LSTM, PC-LSTM[32] adds peephole con-
nection to the memory cell so that every gate can also accept the input
information of the cell state.
M

• CIFG-LSTM

CIFG-LSTM[33] combine the input and forget gate of LSTM and it re-
ED

quires a smaller number of parameters than LSTM.

• GRU
PT

It combines the forget gate and the input gate into a single update gate[34].
The unit state and the hidden state are also merged, so the GRU model
CE

is simpler than the standard LSTM model.

• 2-layer LSTM
AC

In 2-layer LSTM architectures, the hidden state of an LSTM unit in first


layer is used as input to the LSTM unit in second layer in the same time
step[19]. Here, the idea is to let the second layer capture longer-term
dependencies of the input sequence.

17
ACCEPTED MANUSCRIPT

IMDB Yelp 2014 Yelp 2015


Model
Acc MSE Acc MSE Acc MSE
NB 0.353 4.36 0.583 0.83 0.613 0.73
SVM 0.404 3.54 0.608 0.75 0.589 0.81

T
RNN 0.232 5.82 0.473 1.15 0.479 1.09

IP
LSTM 0.398 3.21 0.610 0.57 0.617 0.55
PC-LSTM 0.402 3.23 0.612 0.55 0.615 0.56

CR
CIFG-LSTM 0.395 3.17 0.607 0.59 0.610 0.57
GRU 0.405 3.15 0.609 0.55 0.611 0.56
2-layer LSTM 0.401 3.18 0.613 0.51 0.625 0.45
CLSTM
SR-LSTM
SSR-LSTM
0.429
0.440
0.443
US
2.67
2.24
2.25
0.624
0.632
0.639
0.49
0.46
0.45
0.627
0.639
0.638
0.47
0.46
0.44
AN
Bi-LSTM 0.432 2.21 0.625 0.49 0.625 0.48
SSR-BiLSTM 0.463 2.13 0.651 0.41 0.653 0.40
M

Table 3: Results of our model against baseline models on Yelp 2014,Yelp 2015 and IMDB.
Acc and MSE are evaluation metrics. Acc means accuracy(higher is better) and MSE means
mean square error(lower is better) Best results in each group are in gold.
ED

• CLSTM
PT

CLSTM[6] aims at capturing the long-range information by a cache mech-


anism, which divides memory into several groups, and different forgetting
rates, regarded as filters, are assigned to different groups.
CE

4.3. Results

We compare the classification accuracy and MSE(mean square error) with


AC

other competitive models. The results are shown in Table 3,from this, we have
several findings.
1. First, we compare two machine learning methods. They are NB and
SVM(Naive Bayesian and Support Vector Machine), we can find that SVM has

18
ACCEPTED MANUSCRIPT

a better performance than NB on IMDB and Yelp 2014 datasets and NB is


good on Yelp 2015 datasets. They have 0.404 and 0.608 accuracy respectively.
From Table 3, we can find that machine learning methods almost have same
performances as LSTM, but they need to be input a large number of features.

T
Designing effective features is a very fundamental work and the performance

IP
of a machine learning classifier is heavily dependent on the choice of data rep-
resentations and features, but neural network models can automatically learn

CR
features from the characteristics of the data, this is the reason that it’s widely
used for sentiment classification recently. From our experiment, we can include
LSTM and machine learning classifiers have almost the same performance, this
is a good news for us.
US
2. About the selected recurrent neural networks in baseline models, RNN
has the worst performance in modeling long texts due to the vanishing gradient
AN
problem. As a comparison, LSTM, PC-LSTM, CIFG-LSTM and GRU have
better performances which shows that an internal memory and the structure of
three gates play a key role in long texts modeling. The four LSTM models have
M

almost same performances on three datasets.


3. The proposed LSTM with sentence representations (SR-LSTM) and
ED

LSTM with sorted sentence representations (SSR-LSTM) have the best per-
formance in the three datasets and beats CLSTM proposed by [6] and 2-layer
LSTM [19], especially in Yelp 2014, SSR-LSTM achieves 0.639 in accuracy,
PT

which is 0.015 percent better than CLSTM and 0.026 percent better than 2-
layer LSTM. On IMDB and Yelp 2014 datasets, SSR-LSTM have a better per-
CE

formance, however, on Yelp 2015 datasets, SSR-LSTM have almost the same
performance as SR-LSTM, it shows that subjective sentences are rare emerged
at the end of each document on Yelp 2015 datasets.
AC

4. With the help of bidirectional architecture, models can look forward and
backward to capture features in modeling long texts, so Bi-LSTM has a better
performance than single directional models. And in bidirectional models, our
models have good performances, which achieves 0.463, 0.651, 0.653 in accuracy
in IMDB, Yelp 2014, Yelp 2015, 2.13, 0.41, 0.40 in MSE.

19
ACCEPTED MANUSCRIPT

5. In terms of time complexity and numbers of parameters, the two proposed


models(SR-LSTM and SSR-LSTM)and 2-layer LSTM all have two hidden lay-
ers, but our models require less computational resources and time than 2-layer
LSTM and higher accuracy. Compared with the fully connected layer, our mod-

T
els only use the output sentence vector of the first layer as the input of the second

IP
layer, so our models have less parameters and computational time.

4.4. The effect of Word Embeddings

CR
We know that the input of neural networks is word embeddings[9]. It is
well accepted that a good word embedding is crucial to composing text repre-

US
sentations at a higher level. We therefore want to know the effects of differ-
ent word embeddings on our models. We choose IMDB as our document-level
dataset for sentiment classification. We compare randomly initialized vectors
AN
, two word2vec models(CBOW and Skip-gram) [42] and glove [25] on three
models, which are LSTM, SR-LSTM, SSR-LSTM. All the word vectors are 300-
dimensional and learned from Twitter.
M

Next, we discuss the effect of word embeddings of different dimensions on


our model performance and time costing.
ED

Glove.50d Glove.100d Glove.200d Glove.300d


LSTM 0.358 0.377 0.391 0.398
SR-LSTM 0.405 0.422 0.436 0.440
PT

SSR-LSTM 0.403 0.424 0.438 0.443

Table 4: Classification accuracy of LSTM, SR-LSTM and SSR-LSTM with different di-
CE

mensional word embeddings. We compare between Glove.twitter.50d, Glove.twitter.100d,


Glove.twitter.200d and Glove.twitter.300d.
AC

Randomly initialized vectors mean that in the model, we deal with word em-
beddings the same as other parameters, we randomly initialize word embeddings
and other parameters with sampling from uniform distribution in [-0.1,0.1] and
update all parameters with stochastic gradient. Afterwards, distributed repre-
sentation, another saying is word embeddings is proposed by Hinton in 1986

20
ACCEPTED MANUSCRIPT

and many methods of learning distributed representation are proposed by [43].


CBOW and Skip-gram are two models contained in Word2vec [42]. The two
models are trained by neural networks. They take into account semantic rela-
tions of context information. GloVe is an unsupervised learning algorithm for

T
obtaining global context[25].

IP
From Table 6, we can find that two models (CBOW and Skip-gram) of
word2vec and Glove perform better than Randomly initialized vectors, espe-

CR
cially in SSR-LSTM. This shows the importance of context information for word
embedding learning such as Word2vec and Glove. In addition, we can also find
that Glove have a slight increase in accuracy on three models, which indicates

US
the important of global context for estimating a good word representation.
We compare between Glove vectors with different dimensions(50/100/200/300).
Classification accuracy and time cost are given in Table 4 and Table 5 respec-
AN
tively. We can find that 200-dimensional word vectors perform better than
50-dimensional and 100-dimensional word vectors, while 300-dimensional word
embeddings do not show significant improvements. Furthermore, SR-LSTM and
M

50dms 100dms 200dms 300dms


ED

LSTM 7.8 23.2 48.9 113.1


SR-LSTM 12.5 40.6 71.3 132.2
SSR-LSTM 11.4 37.2 72.0 124.2
PT

Table 5: Time cost of each model with 50dms, 100dms, 200dms and 300dms Glove vectors.
Each value means how many minutes cost after training the model
CE

Random CBOW Skip-gram Glove


LSTM 0.343 0.382 0.375 0.398
SR-LSTM 0.398 0.432 0.438 0.440
AC

SSR-LSTM 0.397 0.428 0.439 0.443

Table 6: Classification accuracy of LSTM, SR-LSTM and SSR-LSTM with different word
embeddings. We compare between Randomly initialized vectors, CBOW, Skip-gram and
Glove vectors.

21
ACCEPTED MANUSCRIPT

48 35

46 30

44 25

Time costing(min)
Accuracy(%)

42 20

40 15

T
38 10

36 5

IP
34 0
8 10 12 14 16 18 20 22 24 8 10 12 14 16 18 20 22 24
The number of Maxn The number of Maxn

(a) Classification Accuracy (b) Time Costing

CR
Figure 5: The classification accuracy and time costing of SSR-LSTM on IMDB dataset.

SSR-LSTM have similar time cost and they cost more time than LSTM because
the parameter number of SR-LSTM and SSR-LSTM is higher. But they can
higher classification accuracy.
US
AN
4.5. Effectiveness on Maxn

For our proposed model, SSR-LSTM, Maxn(the maximum number of sen-


M

tences in each document) is highlight, so we compare the classification accuracy


of different selected Maxn and the time costing on IMDB dataset. The model we
choose is SSR-LSTM. The classification accuracy and time costing are respec-
ED

tively shown in Figure 5. From the tendency of two figures, we find that when
the selected Maxn is bigger, the classification accuracy is improved gradually,
especially Maxn from 10 to 16, the increase of accuracy is obvious, and when
PT

Maxn increased to 24, we find the increase of accuracy slowly and decline also
occurs during 22 to 24. The reason is that when Maxn is too large, the number
CE

of documents which contain sentences more than Maxn will be less and less,
so that the impact to accuracy is getting smaller and smaller. From figure 5 ,
we can find with the increase of Maxn, the cost of training time is longer and
AC

longer, even it increases faster because the neural network model needs to train
more parameters as Maxn increases which inceases the training time.
So in our model SSR-LSTM, it is reasonable to value Maxn as the aver-
age number of sentences in each document on three datasets because this can

22
ACCEPTED MANUSCRIPT

not consume too much time and also can ensure to have a high classification
accuracy.

5. Conclusion

T
We introduce new neural network models(SR-LSTM and SSR-LSM) for doc-

IP
ument level sentiment classification. SR-LSTM exploits two layers LSTM mod-
els. The approach encodes semantics of sentences and their relations in doc-

CR
ument representation. Since a document consists of many sentences and one
sentence has a list of words, our model models document in two steps. The first
layer uses word embedding technique to produce sentence vectors and sentence

US
vectors are treated as inputs of the second layer to get document representa-
tions. SSR-LSTM is an approach to improve SR-LSTM which first removes
AN
sentences sentences with less emotional polarity in datasets. Before data is in-
put to SR-LSTM model, we clean three datasets first. We do not change the
order of sentences as a basis and choose fixed number’s sentences according to
M

the sentiment polarity in each document. For SSR-LSTM, such a sorted input
can build a more perfect model. Our two models are trained end-to-end with su-
pervised sentiment classification objectives. Empirical results on three datasets
ED

(IMDB,Yelp 2014, Yelp 2015) shows that our models outperform state-of-the-art
models.
PT

For future work, we want to find a better way to get sentence vectors. In our
model, we simply model document in a sequential way. One could compose doc-
ument representation over discourse tree structures rather than in a sequential
CE

way, such as tree-structured LSTM[19]. We are going to achieve it.

6. Acknowledgments
AC

We thank the constructive work from Guozheng Rao for the fruitful discus-
sions, besides, we would like to thank the anonymous reviews for their valuable
comments. This work is supported by the National Natural Science Foundation
of China (NSFC). (61373165, 61373035 and 61672377).

23
ACCEPTED MANUSCRIPT

References

[1] J. Z. Wang, J. F. Jia, X. Liu, W. D. Chen, Q. K. Xue, Recognizing contex-


tual polarity: An exploration of features for phrase-level sentiment analysis
35 (3) (2009) 399–433.

T
IP
[2] T. B. Pang, B. Pang, L. Lee, Thumbs up? sentiment classification using
machine learning, Proceedings of Emnlp (2002) 79–86.

CR
[3] P. Liu, X. Qiu, X. Huang, Recurrent neural network for text classification
with multi-task learning, in: International Joint Conference on Artificial
Intelligence, 2016, pp. 2873–2879.

US
[4] B. O’Connor, R. Balasubramanyan, B. R. Routledge, N. A. Smith, From
tweets to polls: Linking text sentiment to public opinion time series, in:
AN
International Conference on Weblogs and Social Media, Icwsm 2010, Wash-
ington, Dc, Usa, May, 2010.
M

[5] A. Graves, Long short-term memory, Neural Computation 9 (8) (1997)


1735.
ED

[6] J. Xu, D. Chen, X. Qiu, X. Huang, Cached long short-term memory neural
networks for document-level sentiment classification (2016) 1660–1669.

[7] D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neural
PT

network for sentiment classification, in: Conference on Empirical Methods


in Natural Language Processing, 2015, pp. 1422–1432.
CE

[8] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y.


Ng, C. Potts, Recursive deep models for semantic compositionality over a
AC

sentiment treebank.

[9] Y. Bengio, H. Schwenk, J. S. Sencal, F. Morin, J. L. Gauvain, Neural


probabilistic language models, Journal of Machine Learning Research 3 (6)
(2003) 1137–1155.

24
ACCEPTED MANUSCRIPT

[10] B. Pang, L. Lee, Seeing stars: exploiting class relationships for sentiment
categorization with respect to rating scales, in: Meeting on Association for
Computational Linguistics, 2005, pp. 115–124.

[11] P. Bo, L. Lee, Opinion Mining and Sentiment Analysis, Now Publishers

T
Inc., 2008.

IP
[12] P. D. Turney, et al. thumbs up or thumbs down? semantic orientation
applied to unsupervised classification of reviews, Proceedings of Annual

CR
Meeting of the Association for Computational Linguistics (2002) 417–424.

[13] A. B. Goldberg, X. Zhu, Seeing stars when there aren’t many stars: graph-

US
based semi-supervised learning for sentiment categorization, in: The Work-
shop on Graph Based Methods for Natural Language Processing, 2006, pp.
45–52.
AN
[14] S. Wang, C. D. Manning, Baselines and bigrams: simple, good sentiment
and topic classification, in: Meeting of the Association for Computational
M

Linguistics: Short Papers, 2012, pp. 90–94.

[15] R. Xia, C. Zong, Exploring the use of word relation features for sentiment
ED

classification., 2 (2010) 1336–1344.

[16] G. Ifrim, G. Weikum, The bag-of-opinions method for review rating predic-
tion from sparse text patterns, in: COLING 2010, International Conference
PT

on Computational Linguistics, Proceedings of the Conference, 23-27 August


2010, Beijing, China, 2010, pp. 913–921.
CE

[17] P. W. Foltz, W. Kintsch, T. K. Landauer, Running head: Textual coherence


using latent semantic analysis the measurement of textual coherence with
latent semantic analysis.
AC

[18] T. A. Mikolov, Statistical language models based on neural networks.

[19] K. S. Tai, R. Socher, C. D. Manning, Improved semantic representations


from tree-structured long short-term memory networks, Computer Science
5 (1) (2015) : 36.

25
ACCEPTED MANUSCRIPT

[20] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly


learning to align and translate, Computer Science.

[21] A. Graves, N. Jaitly, A. R. Mohamed, Hybrid speech recognition with deep


bidirectional lstm, in: Automatic Speech Recognition and Understanding,

T
2014, pp. 273–278.

IP
[22] H. Zhang, J. Li, Y. Ji, H. Yue, Understanding subtitles by character-level

CR
sequence-to-sequence learning, IEEE Transactions on Industrial Informat-
ics 13 (2) (2017) 616–624.

[23] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with

US
neural networks 4 (2014) 3104–3112.

[24] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed rep-


AN
resentations of words and phrases and their compositionality, Advances in
Neural Information Processing Systems 26 (2013) 3111–3119.

[25] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word


M

representation, in: Conference on Empirical Methods in Natural Language


Processing, 2014, pp. 1532–1543.
ED

[26] Q. Qian, B. Tian, M. Huang, Y. Liu, X. Zhu, X. Zhu, Learning tag embed-
dings and tag-specific composition functions in recursive neural network,
in: Meeting of the Association for Computational Linguistics and the In-
PT

ternational Joint Conference on Natural Language Processing, 2015, pp.


1365–1374.
CE

[27] T. Mikolov, M. Karafit, L. Burget, J. Cernocky, S. Khudanpur, Recurrent


neural network based language model, in: INTERSPEECH 2010, Confer-
AC

ence of the International Speech Communication Association, Makuhari,


Chiba, Japan, September, 2010, pp. 1045–1048.

[28] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with


gradient descent is difficult., IEEE Transactions on Neural Networks 5 (2)
(2002) 157–166.

26
ACCEPTED MANUSCRIPT

[29] P. Bhatia, Y. Ji, J. Eisenstein, Better document-level sentiment analysis


from rst discourse parsing, Computer Science.

[30] T. Ke, A. Bisazza, C. Monz, Recurrent memory networks for language


modeling.

T
IP
[31] D. Tang, B. Qin, X. Feng, T. Liu, Effective lstms for target-dependent
sentiment classification, Computer Science.

CR
[32] F. A. Gers, J. Schmidhuber, Recurrent nets that time and count, in: Ieee-
Inns-Enns International Joint Conference on Neural Networks, 2000, pp.
189–194 vol.3.

US
[33] J. Chung, C. Gulcehre, K. H. Cho, Y. Bengio, Empirical evaluation of gated
recurrent neural networks on sequence modeling, Eprint Arxiv.
AN
[34] K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, Y. Bengio, Learning phrase representations using rnn encoder-
decoder for statistical machine translation, Computer Science.
M

[35] D. Tang, B. Qin, T. Liu, Learning semantic representations of users and


ED

products for document level sentiment classification, in: Meeting of the As-
sociation for Computational Linguistics and the International Joint Con-
ference on Natural Language Processing, 2015, pp. 1014–1023.
PT

[36] S. M. Kim, E. Hovy, Automatic detection of opinion bearing words and


sentences, Proceedings of Ijcnlp.
CE

[37] A. Karpathy, J. Johnson, L. Fei-Fei, Visualizing and understanding recur-


rent networks.
AC

[38] K. Denecke, Using sentiwordnet for multilingual sentiment analysis, in:


IEEE International Conference on Data Engineering Workshop, 2008, pp.
507–512.

27
ACCEPTED MANUSCRIPT

[39] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. Mc-


closky, The stanford corenlp natural language processing toolkit, in: Meet-
ing of the Association for Computational Linguistics: System Demonstra-
tions, 2014.

T
[40] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online

IP
learning and stochastic optimization, Journal of Machine Learning Re-
search 12 (7) (2011) 257–269.

CR
[41] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, C. J. Lin, Liblinear: A
library for large linear classification, Journal of Machine Learning Research
9 (9) (2008) 1871–1874.
US
[42] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word
AN
representations in vector space, Computer Science.

[43] A. Paccanaro, G. E. Hinton, Learning Distributed Representations of Con-


cepts Using Linear Relational Embedding, IEEE Educational Activities
M

Department, 2001.
ED
PT
CE
AC

28
ACCEPTED MANUSCRIPT

Biography

T
IP
Weihang Huang School of Computer Science and Technology, Tianjin Uni-

CR
versity, Tianjin, China Corresponding author Mail: [email protected]
Guozheng Rao Tianjin Key Laboratory of Cognitive Computing and Ap-
plication, Tianjin, China Mail: [email protected]

US
Zhiyong Feng Tianjin Key Laboratory of Cognitive Computing and Ap-
plication, Tianjin, China and School of Computer Software, Tianjin University,
AN
Tianjin, China
Qiong Cong Tianjin Key Laboratory of Cognitive Computing and Appli-
cation, Tianjin, China *Biography of the aut
M
ED
PT
CE
AC

29

You might also like