Applications of Deep Learning To Sentiment Analysis of Movie Reviews
Applications of Deep Learning To Sentiment Analysis of Movie Reviews
of Movie Reviews
Houshmand Shirani-Mehr
Department of Management Science & Engineering
Stanford University
[email protected]
Abstract
Sentiment analysis is one of the main challenges in natural language processing.
Recently, deep learning applications have shown impressive results across different NLP tasks. In this work, I explore performance of different deep learning
architectures for semantic analysis of movie reviews, using Stanford Sentiment
Treebank as the main dataset. Recurrent, Recursive, and Convolutional neural
networks are implemented on the dataset and the results are compared to a baseline Naive Bayes classifier. Finally the errors are analyzed and compared. This
work can act as a survey on applications of deep learning to semantic analysis.
Introduction
Sentiment analysis or opinion mining is the automated extraction of writers attitude from the text
[1], and is one of the major challenges in natural language processing. It has been a major point
of focus for scientific community, with over 7,000 articles written on the subject [2]. As an important part of user interface, sentiment analysis engines are utilized across multiple social and review
aggregation websites. However, the domain of the applications for Sentiment Analysis reaches far
from that. It provides insight for businesses, giving them immediate feedback on products, and
measuring the impact of their social marketing strategies [3]. In the same manner, it can be highly
applicable in political campaigns, or any other platform that concerns public opinion. It even has
applications to stock markets and algorithmic trading engines [4]-[5].
It should be noted that adequate sentiment analysis is not just about understanding the overall sentiment of a document or a single paragraph. For instance, in product reviews usually the author
does not limit his view to a single aspect of the product. The most informational and valuable reviews are the ones that discuss different features, and provide a comprehensive list of pros and cons.
Therefore, it is important to be able to extract sentiments on a very granular level, and relate each
sentiment to the aspect it corresponds to. On the more advanced level, the analysis can go beyond
only positive or negative attitude, and identify complex attitude types.
Even on the level of understanding a single sentiment for the whole document, sentiment analysis is
not a straightforward task. Traditional approaches involve building a lexicon of words with positive
and negative polarities, and identifying the attitude of the author by comparing words in the text
with the lexicon [6]. In general, the baseline algorithm [7] consists of tokenization of the text,
feature extraction, and classification using different classifiers such as Naive Bayes, MaxEnt, or
SVM. The features used can be engineered, but mostly involve the polarity of the words according
to the gathered lexicon. Supervised [8] and semi-supervised [9] approaches for building high quality
lexicons have been explored in the literature.
However, traditional approaches are lacking in face of structural and cultural subtleties in the written
language. For instance, negating a highly positive phrase can completely reverse its sentiment, but
unless we can efficiently present the structure of the sentence in the feature set, we will not be able to
1
Introduction
Sentiment Analysis: one of major challenges in NLP
Provides insight for businesses, measuring impact of social marketing
capture
this effect. On
a more abstract
level, it willcampaigns
be quite challenging for a machine to understand
Immediate
feedback
for political
sarcasm
in
a
review.
The
classic
approaches
to
sentiment
analysis and natural language processing
Part of user-interface for many social platforms
are heavily based on engineered features, but it is very difficult to hand-craft features to extract
properties
mentioned. And due indeed
to the dynamic
naturesubtleties
of the language, those features might become
Not
straightforward,
many
obsolete in a short span of time.
Recently, deep learning algorithms have shown impressive performance in natural language pro Sentences
can
have complex
structure
cessing
applications
including
sentiment analysis
across multiple datasets [10]. These models do
not
need to be
provided
withnuances
pre-definedsuch
features
Social
and
lingual
as hand-picked
sarcasm by an engineer, but they can learn
sophisticated features from the dataset by themselves. Although each single unit in these neural
Traditional
Approach:
Engineering
features,
emphasizing
onthese
networks is fairly
simple, by stacking
layers of non-linear
units at the
back of each other,
models are capable of learning highly sophisticated decision boundaries. Words are represented in
word
and
phrases.
a high
dimensional
vector space, and the feature extraction is left to the neural network [11]. As
a result, these models can map words with similar semantic and syntactic properties to nearby loDeep
Learning: Represent words in a vector space, leave feature
cations in their coordinate system, in a way which is reminiscent of understanding the meaning of
words. Architectures
RecursiveNetwork
Neural Networks are also capable of efficiently understanding
extraction
to thelike
Neural
the structure of the sentences [12].These characteristics make deep learning models a natural fit for
esults
in complex
features and decision boundaries => Better results
aR
task
like sentiment
analysis.
Can m
Issues:
In
C
Ig
In this work, I am going to explore performance of different deep learning architectures for semantic
analysis of movie reviews. First, a preliminary investigation on the dataset is done. Statistical
properties of the data are explored, a Naive Bayes baseline classifier is implemented on the dataset,
and the performance of this classifier is studied. Then different deep learning architecture are applied
to the dataset, and their performance and errors are analyzed. Namely, Deep dense networks with
no particular structure, Recurrent Neural Networks, Recursive Neural Networks, and Convolutional
Neural Networks are investigated. At the end, a novel approach is explored by using bagging and
Stanford
Treebank
particularlySentiment
random forests for
convolutional neural networks.
Dataset
The11,855
from movie
reviews
dataset sentences
used for thisextracted
work is the Stanford
Sentiment
Treebank dataset [13], which contains
sentence
extracted
from movie
reviews.
These sentences
11,855
215,154
unique
phrases,
and
fully labeled
parsecontain
trees215,154 unique phrases,
and
have
fully
labeled
parse
trees.
The
sentences
are
already
parsed
by Stanford
Parser and the
semantic
5 classes
for sentiment, from strongly negative to strongly
positive
of each phrase on the tree is provided. The dataset has five classes for its labels, and a
cross-validation
8,544 training
1,101
sentences
in validation
and test
2,210
split ofexamples,
8,544 training
examples,
1,101 validation
samples, set,
and 2,210
casestest
is
already
provided
with
the
data.
Figure
1
shows
a
sample
of
this
dataset.
cases.
Vanilla
Furthe
2 Preliminary
Analysis & of
Baseline
Results
Based
on distribution
labels,
accuracy
Theused
first step
exploring performance
of different classifiers on a dataset is to identify an effective
is
asinperformance
measure.
performance measure. In many cases, especially when the dataset is heavily biased towards one of
the label classes, using
is not by
the best
way to measure
Performance
isaccuracy
measure
accuracy
of performance. However, as shown
in figure 2, the distribution of sample labels in Stanford Sentiment Treebank (SST) dataset is not
predicted
label
theAdditionally,
root (thepredicting
wholenone of the classes carries bigger weight
dominated by any
singleat
class.
sentence).
Distribution of Sentiment
2
Nave Bayes:
Labels
rformance measure.
is measure by accuracy of
el at the root (the whole
compared to the others. The distribution of labels in the validation set shows same structure. Therefore, accuracy can be used here as an effective measure to compare results of different classifiers.
Distribution of Sentiment
Labels
: 78.3 %
et: 38.0 %
3%
2000
Count
1500
1000
ve
ve
iti
Po
si
ti
Po
s
ra
l
eu
t
N
eg
N
St
St
ro
ro
ng
l
ng
ly
eg
at
ive
at
ive
Label
Although SST provides sentiments of phrases in the dataset as well, and we are able to train our
models using that information, sentiment analysis engines are usually evaluated on the whole sentence as a unit. Therefore, in this work the final performance is measured for the sentences, which
corresponds to the sentiment at the root of a tree in SST.
To have a baseline result for comparing how well the deep learning models perform, and to get
a better understanding of the dataset, a Naive Bayes classifier is implemented on the data. The
results of this classifier is shown in table 1. While the training accuracy is high, the test accuracy is
around 40%. Figure 3 is a visualization for the confusion matrix of the classifier. The figure shows
that Naive Bayes classifier performs relatively well in separating positive and negative sentiments,
however it is not very successful in modeling the lower level of separation between "strong" and
regular sentiment. Therefore, making the decision boundaries more complex seems like a viable
option for improving the performance of the classifier. This option is explored in following sections.
The simplest model to apply to the sentiment analysis problem in deep learning platform is to use
an average of word vectors trained by a word2vec model. This average can be perceived as a representation for the meaning of a sentence, and can be used as an input to a classifier. However, this
approach is not very different from bag of words approach used in traditional algorithms, since it
only concerns about single words and ignores the relations between words in the sentence. Therefore, it cannot be expected from such a model to perform well. The results in [13] show that this
intuition is indeed correct, and the performance of this model is fairly distant from state-of-the-art
classifiers. Therefore, I skip this model and start my implementation with more complex ones.
The next natural choice is to use a deep dense neural network. As the input, vectors of words in
the sentence are fed into the model. Various options like averaging word vectors or padding the
sentences were explored, yet none of them achieved satisfactory results. The models either did
not converge or overfit to the data with poor performance on validation set. None of these models
achieved accuracy higher than 35%. The intuition for these results is that while these models have
too many parameters, they do not effectively represent the structure of the sentence and relations between words. While in theory they can represent very complex decision boundaries, their extracted
features do not generalize well to the validation and test set. This motivates using different classes of
neural networks, networks that using their architecture can represent the structure of the sentences
in a more elegant way.
3
Figure#from#lecture#slides#
Obeservations
Vanilla model (sigmoid non-linearity) does not perform well
0.7
0.65
0.6
0.55
0.5
Further improvements
0.35
0.3
0.25
8
10
Epoch Number
12
14
16
0.5
0.48
0.55
0.46
0.5
0.44
0.45
0.42
Accuracy
Accuracy
0.45
0.4
0.4
0.38
0.35
0.36
0.3
0.34
0.25
0.2
0.32
10
20
30
wvecdim
40
50
0.3
60
10
20
30
Batch Size
40
50
60
Figure 6: Recurrent Neural Network: Effect Figure 7: Recurrent Neural Network: Effect
of Word Vector Dimension
of Batch Size
Recurrent
neural networks
are vectors
not the most natural fit for representingSize
sentences
(Recursive neural
Dimension
of word
of Minibatch
networks are a better fit to the task for instance), however it is beneficial to explore how well they
perform for classifying sentiments. Figure 4 1 shows the structure of a vanilla recurrent neural
network. The inputs are the successive word vectors from the sentence, and the outputs can be
formulated as following:
h(t) = f(Hh(t1) + Lx(t) )
y(t) = softmax(U h(t) )
Where f is the non-linearity which is initially the sigmoid function, and y(t) is the prediction probability for each class. One possible direction is to use y at the last word in the sentence as the
prediction for the whole sentence, since the effect of all the words have been applied to this prediction. However, this approach did not yield higher than 35% accuracy in my experimentations.
1
Learning Curve
4
Course#Project#Poster#
Recursive Neural Networks
rning in
tions
vie Review
Motivated by [14], I added a pooling layer between softmax layer and the hidden layer, which
ord#Deep#Learning#Tutorial# increases the accuracy to 39.3% on the validation set. The pooling is done on h(t) values, and mean
pooling achieves
almost 1% higher accuracy compared to max pooling. As a further improvement,
Figure#from#lecture#slides#
LSTM unit was used as the non-linearity in the network. With only this change, the performance
does not improve, and the model overfits due to more parameters in the LSTM unit. However, by
Obeservations
using Dropout [19] as a better regularization technique, the model is able to achieve 40.2% accuracy
on the
validation set and 40.3% accuracy on the test set. This accuracy is almost the same as the
model performs very
well
baseline model.
Single layer, tanh non-linearity:
Obeservatio
Even vanilla model performs very wel
42.2 %
Intuition:
Utilizes the structure of the s
n: Utilizes the structurethe
ofeffect
the sentence
and
phrase-level
labels on the
of changing
different
hyperparameters
accuracy
of the model.
Course#Project#Poster#
rovements
ths
Further improvements
5 Recursive Neural Networks
ayer
Recursive Neural Networks
2vec
vectors,
performs
poorly
2-deep layer
fits, should
use dropout
regularization
ve Neural Tensor Networks
Overfits, should use dropout regula
arameters, do not converge
Recursive Neural Tensor Networks
e
boundaries
on 42.2
the%input
yer,
tanh non-linearity:
on test set
stanford.edu
Figure 5 shows the learning curve for the recurrent neural network model, and figures 6 and 7 show
Figure#from#lecture#slides#
al Networks
Obeservations
Even vanilla model performs very well
nput
Further improvements
f word vectors
oorly
rge
2-deep layer
Overfits, should use dropout regularization
Recursive Neural Tensor Networks
from#lecture#slides#
Learning Curve
Figure 9: Recursive
Neural Network: Learning
Curve
of word
vectors
Figure Dimension
10: Recursive Neural
Network:
Effect of
Word Vector Dimension
tions
Matrix for best
Figure 8 shows
the structure of a recursive neural network. The structure of the network is based
model
y) does not perform
well
on the structure of the parsed tree for the sentence. The vanilla model for this network can be
2
9.3% on validation
set)as follows:
formulated
h = f(W
-linearity in sentences
Dimension of word vectors
well
hLeft
+ b)
hRight
Since this model is already studied in detail in the assignments, and specially since Convolutional
Neural Networks achieve higher accuracy, I did not experiment with Recursive neural networks in
extent. The learning curve and some experimentations on the hyperparameters of the model are
Confusion
Matrix
best
shown in figures 9 and 10. The accuracy of the model is 42.2%
on the test set,
which isfor
higher
than
recursive
on
test neural
set networks and the baseline results.
model
Convolutional
Neural Networks
Figure#from#Kim#(2014)#
Learning Curve
In convolutional
networks,
Confusionneural
Matrix
for besta filter with a specific window size is run over the sentence, generating different results.
modelThese results are summarized using a pooling layer to generate one vector as
Obeservations
2
Convolutional Neura
Figure#from#Kim#(2014)#
Obeservations
the output of the filter layer. Different filters
can be applied to generate different outputs, and these
outputs can be used with a softmax layer to generate prediction probabilities. Figure 11 3 shows the
structure of this network. The model can be described using following equations:
(j)
ci =achieve
f (W xsuperior
i:i+h1 + b)
State-of-the-art CNNs
performance
(j)
(j)
(j)
(j)
c final
= max(c
1 , c2 , . . . , cnh+1 )
Will be included in the
write-up
(j)
y = softmax(W (s)
c + b(s) )
Where h is the length of the filter. For this work, I have used the model proposed by Kim [20], which
uses Dropout and regularization on the size of gradients as approaches to help the model converge
better.
Figure 12 shows the learning curve of the Convolutional neural network, and figure 13 shows that
50 is the local optimal dimension for word vectors used in the model.
100
0.5
0.48
90
0.46
80
0.44
0.42
Accuracy
Accuracy
70
60
0.4
0.38
50
0.36
40
0.34
30
20
0.32
8
10
12
Epoch Number
14
16
18
0.3
25
20
30
35
40
45
50
wvecdim
55
60
65
70
75
Figure 12: Convolutional Neural Network: Figure 13: Convolutional Neural Network:
Learning Curve
Effect of Word Vector Dimension
While we observe a slight improvement over Recurrent neural networks, the results are not significantly better than Baseline classifier. The significant gap between the training error and test error
shows that there is a serious overfitting in the model. As a solution, instead of training the word
vectors along other parameters using samples, predefined 300-dimensional vectors from word2vec 4
model are used, and are kept fixed during the training phase. These vectors are trained based on a
huge dataset of news articles. The resulted model shows a significant improvement in the accuracy.
Figure 14 shows the learning curve for this model. The model trains very fast (highest validation
accuracy is at epoch 5) and the final accuracy on test set is 46.4%.
Table 1 shows the comparison of results for different approaches explored in this work.
3
4
from [20]
Available from https://code.google.com/p/word2vec/
1
0.9
0.8
Accuracy
0.7
0.6
0.5
0.4
0.3
0.2
8
10
12
Epoch Number
14
16
18
20
Figure 14: Convolutional Neural Network with word vectors fixed from word2vec model: Learning
Curve
Recurrent neural networks are not an efficient model to represent structural and contextual properties
of the sentence, and their performance is close to the baseline Naive Bayes Algorithm.
Recursive neural networks are built based on the structure of the parsed tree of sentences, therefore
they can understand the relations between words in a sentence more adequately. Additionally, they
can use the phrase-level sentiment labels provided with the SST dataset for their training. Therefore,
we expect Recursive networks to outperform Recurrent networks and baseline results.
Convolutional neural network can be assumed as a generalized version of recursive neural networks.
However, like recurrent neural networks, they have the disadvantage of losing phrase-level labels as
training data. On the other hand, using word vectors from word2vec model results in a significant
improvement in the performance. This change can be contributed to the fact that due to large number
of parameters, neural networks have a high potential for overfitting. Therefore, they require a large
amount of data in order to to find generalizable decision boundaries. Learning the word vectors
along other parameters from sentence-level labels in SST dataset results in overfitting and degrade
performance on the validation set. However, once we use pre-trained word2vec vectors to represent
words and do not update them during the training, the overfitting decreases and the performance
improves.
Figures 15 and 16 show the confusion matrix of the two best model from the experimentations.
Comparing to the confusion matrix for Naive Bayes, we can see that the correct predictions are
distributed more evenly across different classes. Naive Bayes classifier is not as consistent as deep
learning models in predicting classes on a more granular level. As mentioned before, this is due to
capacity of deep neural networks in learning complex decision boundaries. While it is possible to
engineer and add features in such a way that the performance of Naive Bayes classifier improves,
the deep learning model extracts features by itself and gain significantly higher performance.
7
Model
Naive Bayes
Recurrent Neural Network
Recursive Neural Network
Convolutional Neural Network
Convolutional Neural Network + word2vec
Training Accuracy
78.3
56.8
54.0
72.7
88.2
Validation Accuracy
38.0
40.2
38.6
41.1
44.1
Test Accuracy
40.3
40.3
42.2
40.5
46.4