Sentiment Analysis Classification For Rotten Tomatoes Phrases On Kaggle
Sentiment Analysis Classification For Rotten Tomatoes Phrases On Kaggle
Phrases on Kaggle
Kevin Hung
[email protected]
2. DATASET
The original Rotten Tomatoes sentences were gathered as
described in Pang and Lee's (2005) [1] approach to
sentiment classification using metric learning, using 10,662
review snippets which were usually a sentence long. Then
Socher et al. from Stanford NLP refined the snippet data
into a more fine-grained form of pharsed phrases and used
Amazon Mechanical Turk to outsource the manual task of
interfacing and annotating the sentiments of the phrases
[2].
The sentiment labels appear to be very symmetric and
For the version of the data we obtained from the Kaggle
slightly peakier than the normal distribution. The most
website [3], a tab delimited file containing around 156,060
frequent label is neutral which is the clear baseline that our
training records with only the phrase's original sentence id
basic model should predict.
and the actual phrase as the features and the sentiment value
as the label. The distribution of the unigrams, bigrams and
trigrams comprised 42% of phrases in the training set, and
The second file for testing contains 66,292 records with
the following boxplot describes the lengths of the phrases
only the sentence id and the phrase values provided.
and shows 75% of them being 10 or less and the rest mainly
being 10-20 words in length:
Figure 2. Phrase Length Distribution labels divided by the total number of samples or the
distance between 1 and the Hamming Loss:
where
As a result our model score increased to 0.60457. The significant results and insight we gained in this
study is that Naïve Bayes again outperforms linear
regression in simplicity (i.e. no need to calculate the weight
5.5 Nearest Neighbor based on Cosine Similarity of vectors, just count the number of times each unigram
TF-IDF appears) and accuracy. And another significant result is that
the binning threshold discovered in the exploratory section
The model that used clustering of similar phrases can help increase accuracy by 2%.
based on TF-IDF features did not have adequate or
reasonable computation time, but the decision function
developed is listed below:
Table 3. Model Performance 8. Acknowledgements
A deep token of appreciation for all members of the Data
Model Score
Science community at UCSD and the Computer Science
0.60457 and Engineering Department for giving the opportunity to
Binned Multinomial Naïve Bayes offer a Data Mining course at an undergraduate level.
0.58681 9. REFERENCES
Multinomial Naïve Bayes [1] Pang, Bo, and Lillian Lee. "Seeing stars:
Exploiting class relationships for sentiment
0.51789 categorization with respect to rating scales."
Baseline Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics.
0.50952 Association for Computational Linguistics, 2005.
Linear Regression [2] Socher, Richard, et al. "Recursive deep models
for semantic compositionality over a sentiment
treebank." Proceedings of the conference on
The feature representation that worked well is the term- empirical methods in natural language processing
document matrix, unlike the best fitting line found by linear (EMNLP). Vol. 1631. 2013.
regression. An explanation as to why linear regression [3] https://www.kaggle.com/c/sentiment-analysis-on-
performed worse than the baseline is that there is a high movie-reviews/data
bias/misassumption that adding weights linearly based
[4] Maas, Andrew L., et al. "Learning word vectors for
feature words represents the sentiment accurately. Because
sentiment analysis."Proceedings of the 49th
of the misassumption and high inaccuracy, the
Annual Meeting of the Association for
interpretation of the parameters for linear regression can not
Computational Linguistics: Human Language
reliably represent the sentiment of the phrase.
Technologies-Volume 1. Association for
The models used in this study were not complex, and Computational Linguistics, 2011.
scaling was not an issue given the size of the training and
testing sets. If there were more time and resources to
conduct the study, then overfitting could be estimated using
cross-validation.