0% found this document useful (0 votes)
47 views4 pages

Sentiment Analysis Classification For Rotten Tomatoes Phrases On Kaggle

This document discusses applying machine learning techniques to classify the sentiment of phrases from Rotten Tomatoes reviews. It begins by introducing the dataset and exploring the distribution of sentiment labels. Most phrases are neutral or positive, and about half are 5 words or less. The goal is to predict sentiment labels using features like unigram and bigram counts. Several models are considered, including a baseline majority class predictor, bag-of-words naive Bayes, and potentially regressing phrase length. The accuracy of different models will be evaluated on a test set.

Uploaded by

John Jihn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views4 pages

Sentiment Analysis Classification For Rotten Tomatoes Phrases On Kaggle

This document discusses applying machine learning techniques to classify the sentiment of phrases from Rotten Tomatoes reviews. It begins by introducing the dataset and exploring the distribution of sentiment labels. Most phrases are neutral or positive, and about half are 5 words or less. The goal is to predict sentiment labels using features like unigram and bigram counts. Several models are considered, including a baseline majority class predictor, bag-of-words naive Bayes, and potentially regressing phrase length. The accuracy of different models will be evaluated on a test set.

Uploaded by

John Jihn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Sentiment Analysis Classification for Rotten Tomatoes

Phrases on Kaggle
Kevin Hung
[email protected]

ABSTRACT Table 1. Sentiment Label Coding


In the second assignment for CSE 190: Data Mining and
Sentiment Label
Predictive Analytics, we apply some techniques to improve
the accuracy of classifying Rotten Tomatoes phrase Negative 0
sentiments. Somewhat Negative 1

General Terms Neutral 2


Algorithms, Experimentation Somewhat Positive 3
Positive 4
Keywords
Classification, Sentiment Analysis, Opinion Mining,
Naïve Bayes, Binned, Regression
3. Exploratory Analysis
1. INTRODUCTION Analyzing the prior distribution of sentiment labels is
Applying sentiment analysis on reviews based on text important in developing an intuition and obtaining
features is distinct from rating-scale inference problems like reasonable sense of what kind of predictions our models
predicting the rating value on a review (e.g. movies, should make as described in the later section.
restaurants, etc) because we can gain more details and
insight on the human component (e.g. opinions, emotions, Figure 1. Sentiment Label Distribution
feelings) than with numerical features. As Richard
Hamming once famously stated, “The purpose of
computing is insight, not numbers.” One of the main and
many applications of classifying the sentiment of Rotten
Tomatoes phrases through automation and machine
learning is to save the human effort of evaluating each
phrase manually.

2. DATASET
The original Rotten Tomatoes sentences were gathered as
described in Pang and Lee's (2005) [1] approach to
sentiment classification using metric learning, using 10,662
review snippets which were usually a sentence long. Then
Socher et al. from Stanford NLP refined the snippet data
into a more fine-grained form of pharsed phrases and used
Amazon Mechanical Turk to outsource the manual task of
interfacing and annotating the sentiments of the phrases
[2].
The sentiment labels appear to be very symmetric and
For the version of the data we obtained from the Kaggle
slightly peakier than the normal distribution. The most
website [3], a tab delimited file containing around 156,060
frequent label is neutral which is the clear baseline that our
training records with only the phrase's original sentence id
basic model should predict.
and the actual phrase as the features and the sentiment value
as the label. The distribution of the unigrams, bigrams and
trigrams comprised 42% of phrases in the training set, and
The second file for testing contains 66,292 records with
the following boxplot describes the lengths of the phrases
only the sentence id and the phrase values provided.
and shows 75% of them being 10 or less and the rest mainly
being 10-20 words in length:
Figure 2. Phrase Length Distribution labels divided by the total number of samples or the
distance between 1 and the Hamming Loss:

The main types of model we consider are those that deal


with classification since there are 5 possible discrete
categories, but we can also try a regression model since the
sentiment labels are ordinal and scaled. Then we round the
output of our regression to the closest integer.
A model that uses clustering could also be a possibility,
and it would have to aggregate the labels of the closest
neighboring training points. Unsupervised models however
The fact that nearly half of the phrases are less than 5 would not be appropriate in helping us predict the sentiment
tokens in length makes it reasonable for us to develop our class.
features based on unigram and bigram counts that we can
design models that take in a term frequency matrix as input.
4.1 Features
Figure 3. Phrase Length Grouped by Sentiment
The combination of features we will use is a subset of
the number of counts for each unigram/bigram and the
number of tokens in the phrase.
The only pre-processing of the feature will be the
lowercasing the text of the review. Also we leave in
punctuation since some of them have sentiment labels
assigned in the training set.
As described in our exploratory phrase, based on
figures 2 and 3, we assume unigram and bigram counts to
be reasonable features to represent as a term frequency or
term frequency – inverse document frequency matrix.
Finally given the difference between the variation in phrase
length (number of tokens) for each sentiment category, we
can use binning on a threshold like using the output of one
Finally the third figure is the most informative in that it
model to handle phrases with less than or equal to 5 tokens
shows 2-sentiment reviews having less variance and that
and the output of a second one for phrases containing more
there is a threshold at 5 tokens where we can design our
than 5 tokens.
model to use binning: using one model to tackle the case
where the phrase has less than or equal to 5 tokens and
5. Models
another model for more than 5 tokens. Also if we are using
regression models, phrase length can be a possible feature. 5.1 Baseline

4 Prediction Objective The most simplistic model we can use as a baseline to


compare our results with more complex models to predict
After performing a preliminary exploration of the the majority sentiment category, and the most frequently
data, the task/objective we are tackling is predicting the appearing sentiment value is neutral: 2
sentiment label given Phrases as features, using a
supervised model, either classification or regression, from
machine learning. And the score for the baseline model is 0.51789.
To evaluate our model, we submit a list of labels that
5.2 Bag-of-words Multinomial Naïve Bayes
our models output given the phrases from the training set as
input online to Kaggle. Then Kaggle will calculate the
categorization accuracy between 0 and 1, assuming that the The next model is also very simplistic in the naïve sense
competition either uses the number of correctly predicted of just counting the unigrams and representing it as a term-
document matrix. The multinomial variation of NB can be
described as:

where

where the prior probability is

6. Related Works and Literature


and the conditional probability is The Rotten Tomatoes phrases data obtained for this study
originated from Pang and Lee's work in which they describe
using item and label similarity for metric labeling/positive
sentence percentage and compare it with multi-class SVM
The accuracy obtained improves to 0.58681. (one-versus-all) and regression, and they found that
incorporating PSP helps improve average accuracies. Also
5.3 Linear Regression with Polar Words mentioned in the dataset section was Sochel et al.'s work in
creating a sentiment tree bank and labeling the nodes using
Neural Networks obtaining high accuracy of nearly 80%
The third model is an attempt to use linear regression much past the baseline, which qualifies it as state-of-the-art.
with unigrams appearing in the non-neutral training phrases: In the following section we will see that their results and
conclusions are far accurate than our findings.
Other related datasets the 50,000 IMDb positive and
So that our model is of the form: negative reviews like in described in Maas et al.'s work in
developing probabilistic model related to LDA that can
learn word vector representations and is able to capture
sentiment and semantics similarities[4].
The techniques and theories described in the cited
works were too advanced to incorporate into the models
The number of features is around 700, but the results of used in this study but features and models that were
the regression model sets the accuracy below the baseline mentioned in common included are Naïve Bayes, SVM tf-
with a score of 0.50952. idf matrix representations and similarity measures. The
following section on results and conclusions cover the use
5.4 Binned Multinomial Naïve Bayes of Naïve Bayes and tf-idf matrix representations and an
Given that we explored the length/number of tokens in attempt at calculating similar phrases.
phrases grouped by sentiment categories, we found that the
mass of neutral phrases had less than or equal to 5 tokens, 7 Results and Conclusions
so we decide to use that as a threshold for our binned The models we developed in this study do not perform
multinomial NB model. We trained two multinomial NB as well as the state-of-the-art or even close to the top scores
models based having more or no more than 5 tokens and of the Kaggle competition. Some of the Kagglers
predicted values correspondingly too. implemented their own RNN.

As a result our model score increased to 0.60457. The significant results and insight we gained in this
study is that Naïve Bayes again outperforms linear
regression in simplicity (i.e. no need to calculate the weight
5.5 Nearest Neighbor based on Cosine Similarity of vectors, just count the number of times each unigram
TF-IDF appears) and accuracy. And another significant result is that
the binning threshold discovered in the exploratory section
The model that used clustering of similar phrases can help increase accuracy by 2%.
based on TF-IDF features did not have adequate or
reasonable computation time, but the decision function
developed is listed below:
Table 3. Model Performance 8. Acknowledgements
A deep token of appreciation for all members of the Data
Model Score
Science community at UCSD and the Computer Science
0.60457 and Engineering Department for giving the opportunity to
Binned Multinomial Naïve Bayes offer a Data Mining course at an undergraduate level.

0.58681 9. REFERENCES
Multinomial Naïve Bayes [1] Pang, Bo, and Lillian Lee. "Seeing stars:
Exploiting class relationships for sentiment
0.51789 categorization with respect to rating scales."
Baseline Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics.
0.50952 Association for Computational Linguistics, 2005.
Linear Regression [2] Socher, Richard, et al. "Recursive deep models
for semantic compositionality over a sentiment
treebank." Proceedings of the conference on
The feature representation that worked well is the term- empirical methods in natural language processing
document matrix, unlike the best fitting line found by linear (EMNLP). Vol. 1631. 2013.
regression. An explanation as to why linear regression [3] https://www.kaggle.com/c/sentiment-analysis-on-
performed worse than the baseline is that there is a high movie-reviews/data
bias/misassumption that adding weights linearly based
[4] Maas, Andrew L., et al. "Learning word vectors for
feature words represents the sentiment accurately. Because
sentiment analysis."Proceedings of the 49th
of the misassumption and high inaccuracy, the
Annual Meeting of the Association for
interpretation of the parameters for linear regression can not
Computational Linguistics: Human Language
reliably represent the sentiment of the phrase.
Technologies-Volume 1. Association for
The models used in this study were not complex, and Computational Linguistics, 2011.
scaling was not an issue given the size of the training and
testing sets. If there were more time and resources to
conduct the study, then overfitting could be estimated using
cross-validation.

You might also like