0% found this document useful (0 votes)
42 views

Analyzing Sentiment Using IMDb Dataset

The document discusses performing sentiment analysis on movie reviews from the IMDb dataset using machine learning algorithms like Naive Bayes, Logistic Regression, Random Forest, and Decision Tree. It compares the results of these algorithms based on various evaluation metrics and analyzes the process of text preprocessing, feature extraction using bag-of-words, model training, and model evaluation.

Uploaded by

Saad Tayef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Analyzing Sentiment Using IMDb Dataset

The document discusses performing sentiment analysis on movie reviews from the IMDb dataset using machine learning algorithms like Naive Bayes, Logistic Regression, Random Forest, and Decision Tree. It compares the results of these algorithms based on various evaluation metrics and analyzes the process of text preprocessing, feature extraction using bag-of-words, model training, and model evaluation.

Uploaded by

Saad Tayef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

12th International Conference on Computational Intelligence and Communication Networks

Analyzing Sentiment using IMDb Dataset


Sandesh Tripathi, Ritu Mehrotra, Vidushi Bansal, Shweta Upadhyay
Department of Computer Science
Birla Institute of Applied Sciences, Bhimtal, India
E-mail: [email protected], [email protected], [email protected],
[email protected]

ABSTRACT- Text is the largest repository of human Text processing can be very difficult as it includes the
knowledge acquired over thousands of years. This knowledge will analysis of streams of data of reviews. This takes a lot of time
impart even more meaning if mined for deeper insights. Sentiment and effort from the people. The data can be structured, semi-
Analysis (SA) provides a traditional machine learning (ML) structured or unstructured depending upon the repository and
solution to this problem by putting Natural Language Processing platform. Analysis of this dispersed data is possible for
(NLP) to work. In the proposed work, we have performed SA on machines by the use of capturing the tag words that may relate
the IMDb movie reviews dataset taken from Kaggle’s Bag of to the actual meaning of the entire sentence. This way, the
Words meets Bag of Popcorn challenge to demonstrate how tokens can be registered describing the polarity of the
valuable insights can be drawn from a bulk of textual data
statement. It can perform real time analysis delivering results
collected from the internet. We derive these insights by applying
efficiently. Sentiment analysis can be performed in Rule-
four traditional ML algorithms namely, Naïve Bayes (NB),
Logistic Regression (LR), Random Forest (RF), and Decision Tree based, Automatic and Hybrid systems with the help of
(DT). Furthermore, results of these four algorithms were manually crafted rules and machine learning techniques.
compared on the basis of six evaluation metrics – confusion The first step of the procedure is the processing of text to
matrix, accuracy, precision, recall, F1 measure, and Area Under change it into a standard form so as it becomes easy to access
Curve (AUC). the token words.
Keywords- Sentiment Analysis; IMDb reviews dataset; Bag of There are several methods that accounts to the text
Words; Logistic Regression; Naive Bayes; Decision Tree; Random processing where the unnecessary or meaningless words are
Forest. eliminated. It also involves the creation of tokens. Tokens
imparts meaning to the valuable statements and thus by
I. INTRODUCTION referring to the tokens, one can easily understand the
The main purpose of sentiment analysis (SA) is to detect implication of the review. Although it must be done very
the polarity of a statement that is to be inspected. The polarity carefully as sometimes it may lead to wrong conclusions.
can result in either a positive, negative or a neutral value. It is Therefore, one must be careful in the first and most important
important to know the meaning of the statement to obtain the step of generation of tokens.
implication of the sentence and to point to the node of polarity
where the statement is referring to. This includes the The next step includes classification of statements using
understanding of text processing and analysis. While this different algorithms such as Decision Trees, Naive Bayes,
becomes easy for a person to reach the conclusion, the Logistic Regression, Random Forest and k-nearest neighbour.
machine tends to take slow steps to the outcomes. In places The aim of classification is to predict the class of given data
where the number of tweets increases and analysing points. It utilizes the training data to understand how the input
documents are large, it becomes impossible for a single person variables relate to that class. There are two types of learners-
to evaluate the sentiment of the texts. Thus, we require the Lazy learners and Eager learners. The lazy learners store the
ability of machines to process the data in a short period of time training data and perform classification only when the testing
and generate the results. With the ability to learn and process data is loaded in the system. The Eager learners construct a
things at a higher rate, the machines never fail to produce the model of classification based on the training data so that it
best result. becomes easy to classify the resting data once it appears.

The knowledge of the sentiments of the people is very Once the features are extracted, the text is fed to the model
important for business purposes as it serves as a base to the for training. It can be done by using the “Bag of Words”
needs and requirements of the customer upon which the approach. This method is required to associate the features
business can produce good quality goods. The feedback of the with values of different reviews. The Bag of Words converts
customers is equally important as it provides valuable insights the dataset into a matrix form where rows correspond to the
governing the tendency to like or dislike a product. This way reviews and columns refers to the tokens to be extracted from
the demand of the project will be met by the organisations. It the review.
will also help the business to know the performance of their After the cells in the matrix is provided with the values,
project in the market and check if the customers are satisfied the quality of the machine learning algorithm is tested on the
with the quality and pricing plans. basis of:


978-1-7281-9393-9/20/$31.00 ©2020 IEEE
DOI: 10.1109/CICN.2020.0

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 18,2020 at 15:44:18 UTC from IEEE Xplore. Restrictions apply.
• Accuracy considered Amazon’s German movie reviews dataset. Their
• F-measure approach got good results for a multilingual system.
• Area under curve (AUC)
• Recall Reference [7] unlike other citations, worked with pdfs,
• Precision html files, xml files among others. It lists a number of methods
The major role of evaluation metrics is to determine the for sentiment analysis in such systems.
efficiency of the algorithm and to ensure that the model is In [8], a detailed study was made on how Conditional
working perfectly without any complications. Random Field (CRF) and LR can be used for polarity check
The paper is structured as follows: Section II describes the of collection of texts. They have deployed the proposed
related work in the area of SA. Section III shows the proposed approach in the SemEval 2015 ABSA task.
approach in a pointwise manner. Section IV includes the Reference [9] contains a variety of ML approaches that
methods and techniques deployed. Section V presents the can be considered for a SA task. The researchers thus provide
results of the proposed approach, and Section VI concludes a good and effective way to fix which algorithm to use in
the research and discusses future work. which case by describing different ways under a single work.
II. RELATED WORK Indian languages are too used by a massive number of
internet users. A study was made on such text material in [10].
The authors of [1] proposed to apply eight classifiers on
The text was taken from Twitter and tweets were divided into
the IMDb movie reviews dataset. These are - Naive Bayes,
positive, negative, and neutral ones.
Decision Tree, Random Forest, Ripple Rule Learning, K-
Nearest Neighbours, Support Vector Classifier, Bayes Net, III. PROPOSED APPROACH
Stochastic Gradient Descent. In their work Ripple Rule
Learning was found to give the worst results, whereas The work gives a Sentiment Analysis model for
Random Forest outperformed other classifiers. Performance classification of movies reviews taken from IMDb dataset. A
of these eight classifiers were measured by five different pointwise summary of the proposed approach is given.
evaluation metrics, namely, Accuracy, Area Under Curve 1. Retrieval of IMDb dataset from Kaggle Bag of
(AUC), F1- measure, Recall, and Precision. Words Meets Bag of Popcorn Challenge.
Authors of [2] were of the thought that reviews of movies 2. Pre-processing of data including cleaning, HTML
shared on social media platforms and other web portals are tags removal, stopwords removal etc.
important factors in a movie’s financial success. The results
showed that positive sentiment is more efficient for a movie 3. Feature selection from the possible sets of features.
domain with a small number of existing reviews, which 4. Text representation using BoW.
indicated that sentiment alone is not the only factor. Rather,
sentiment could perform better in combination with other 5. Fed to different classifiers.
factors such as movie genre and festive season etc. 6. Evaluated as per six different metrics.
In [3] the authors considered four classifiers; Maximum 7. Comparison done between all the classifiers
Entropy (ME), Naive Bayes (NB), Support Vector Machine deployed.
(SVM), and Stochastic Gradient Descent (SGD) for SA on the
IMDb dataset. Precision, recall, f-measure, and accuracy was Following flowchart summarizes the approach being
used for performance comparison. Highlight of the work was presented.
that a longer “n-gram” got better results than a shorter one
with all the classifiers.
Reference [4] focused on increasing performance of the
Naive Bayes classifier. The authors combined negation
handling, feature selection and word n-gram techniques to
improve the performance. However their best result of 88.8%
has been outperformed in the proposed approach.
In [5] a detailed study has been given on evaluation
metrics. Use of multiple evaluation metrics is very crucial as
an algorithm may perform well w.r.t a single evaluation
method but may not give that good results when weighed
against some other metric.
The authors of [6] aimed at performing SA on a
multilingual system. For this purpose, they used lexical
resources in the English SentiWordNet. For the work they had
Fig. 1. Methodology Flowchart

31

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 18,2020 at 15:44:18 UTC from IEEE Xplore. Restrictions apply.
IV. METHOD DESCRIPTION Values of the DTM cells can be filled in multiple ways,
The methods used in the work have been described in out of these two were explored
detail in this section. • Count - The cell is made to contain the actual count
A. Text pre-processing of occurrence of the word in the corresponding review.
Text gathered in the dataset is not ready to be fed to the • Term Frequency-Inverse Document Frequency(TF-
classifier model as it is. This is primarily because classifiers IDF) - A statistical method of determining the importance of
need data in a prescribed form and not in the form of a word to a specific document (review in this case).
paragraphs. Also, the data has been received majorly from D. Classifiers
internet sources thus have a lot of html tags, abbreviations etc.
To classify the testing data into categories four classifiers were
Following techniques were used to pre-process the data chosen
• Removal of HTML tags - As stated before, the data x Logistic Regression - Fits a hypothesis in the dataset
has been primarily taken from the internet, it therefore consists which is a concrete form of mathematical sigmoidal
some HTML programming part that needs to be scrapped function. It gives out the probability a review would fall in
before as they will not give any insights, rather if these go on the positive category.
to become vectors can pose a threat in terms of memory.
x Naive Bayes - Depends on conditional probability where
• Text Normalization - Movie reviews are generally a the data to be labelled is thought of as a set of conditions
form of casual writing. For example, “good” at times may be that have occurred and the task is to tell the probability of
written as “goooood”, “gud”, “gd” etc. All of these words will a certain category.
impart similar meaning but are available in different forms.
These need to be converted in the base form called the x Decision Tree - A classifier model that gives labels to
canonical form. Additionally, a word may exist in different tokens based on a tree structure, where tree branches
forms i.e. tenses and thus needs to be made fit for our purpose. represent conditions on features, and tree leaves represent
the label. [1]
• Stopwords Removal - Common words such as “is”,
“am”, “are”, “the” etc. are likely to give no meaning to the x Random Forest - It is an extension of Decision Tree.
text. These words are used just to help the main meaning Multiple decision trees are made with the root node being
giving words. Such words are called stopwords. Natural randomly selected. Condition is that the root nodes must
Language Toolkit has a collection of such words from various have as little correlation as possible..
languages. These can simply be removed.
x
B. Feature Extraction
V. RESULTS
x The concerned dataset has a huge amount of text, if taken
in raw form it would turn out with an enormous number of This section summarizes the results achieved.
features, thus feature extraction had to be used. Else, our
ML model would have had overfitting issues. TABLE I. RESULTS WITH COUNT

x Feature extraction is alternatively called dimensionality ML Features


Algorithm Accura
reduction as we are limiting the number of dimensions that / Metrics Precision Recall F-score AUC
cy
will represent our dataset. A feature can only be ignored in Logistic
cases where they do not add any specific meaning to the 0.8728 0.8708 0.8777 0.8742 0.94
Regression
data. Naive
0.8594 0.8566 0.8658 0.8612 -
Bayes
C. Text representation Decision
After pre-processing and feature extraction are done the Tree 0.7134 0.7218 0.7014 0.7114 0.71
Classifier
aim is now to transform the data in some way that could be Random
understood by the classifier model. The models that were Forest 0.8584 0.862 0.8558 0.8589 0.93
chosen required the text to be represented in some Classifier
mathematical structure. Bag of Words (BoW) was picked up.
Table 1 depicts the results achieved with “count ” in the
Bag Of Words DTM as the cell value. Logistic Regression is the clear winner
In BoW the complete dataset is converted into a matrix. in AUC. Decision Tree does not work that well relative to
This matrix is called DTM, Document Term Matrix. The rows other chosen classifiers.
in the matrix correspond to each review whereas the columns
represent the words which form these reviews. Precisely, these
words are rather n-grams. N-gram is a phrase having “n”
words.

32

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 18,2020 at 15:44:18 UTC from IEEE Xplore. Restrictions apply.
TABLE II. RESULTS WITH TF-IDF were deployed for getting results. These results were then
ML Features compared on the basis of different evaluation metrics. Using
Algorithm
/ Metrics Accuracy Precision Recall F-score AUC TF-IDF + Logistic Regression gave the best validation
Logistic
0.8914 0.882 0.9055 0.8936 0.96 AUC of nearly 96% for our task.
Regression
Naive In future, we can use state of the art word embeddings like
0.8228 0.8285 0.8174 0.823 0.85
Bayes Word2vec that overcomes some of the limitations of the TF-
Decision IDF based approach. Word2Vec captures the semantic
Tree 0.7066 0.7098 0.7098 0.7114 0.71
similarity between the words and words with similar
Classifier
Random meanings are placed in close proximity in the vector space.
Forest 0.8562 0.8597 0.8539 0.8568 0.93 However, it still suffers with one of the fundamental problems
Classifier called Out of Vocabulary (OOV) which means that it is unable
to provide an embedding for a word which is not in the
training corpus. To deal with this we can use even more
Table 2 consists of the results after TF-IDF was chosen to sophisticated word embeddings like Facebook’s FastText or
fill DTM values. Logistic Regression again wins with a highly BERT etc. Also, to train these models we shall require more
competitive AUC of 0.96. advanced hardware like GPUs and a higher degree of
The best results were thus achieved when TF-IDF values were parallelization.
filled in DTM and the classifier chosen was Logistic
Regression. The graph below plots this case. REFERENCES
[1] Yasen, Mais, and Sara Tedmori. "Movies Reviews sentiment analysis
and classification." IEEE Jordan International Joint Conference on
Electrical Engineering and Information Technology (JEEIT), 2019.
[2] Mishne, Gilad and Natalie Glance, (2006), “Predicting Movie Sales
from Blogger Sentiment”, AAAI Spring Symposium: Computational
Approaches to Analyzing Weblogs.
[3] Abinash Tripathy, Ankit Agrawal, Santanu Kumar Rath, (2016),
“Classification of sentiment reviews using n-gram machine learning
approach”, Expert Systems with Applications, Vol. 57, PP. 117-126.
[4] Vivek Narayanan, Ishan Arora, Arjun Bhatia, (2013), “Fast and
Accurate Sentiment Classification Using an Enhanced Naive Bayes
Model”, Intelligent Data Engineering and Automated Learning –
IDEAL, Springer, Vol. 8206.
[5] Hossin, Mohammad, and M. N. Sulaiman. "A review on evaluation
metrics for data classification evaluations." International Journal of
Data Mining & Knowledge Management Process 5.2 (2015): 1.
[6] Kerstin Denecke, (2008), "Using SentiWordNet for multilingual
sentiment analysis", IEEE 24th International Conference on Data
Engineering Workshop, Cancun, PP 507-512
[7] .Feldman, Ronen. "Techniques and applications for sentiment
Fig. 2. AUC Curve for Logistic Regression analysis." Communications of the ACM 56.4 (2013): 82-89.
[8] Hamdan, Hussam, Patrice Bellot, and Frederic Bechet. "Lsislif: Crf and
logistic regression for opinion target extraction and sentiment polarity
analysis." Proceedings of the 9th international workshop on semantic
VI. CONCLUSION AND FUTURE WORK evaluation (SemEval 2015). 2015.
In our work, we have presented an approach for the [9] Sharma, Anuj, and Shubhamoy Dey. "A comparative study of feature
classification of the sentiments using the selection and machine learning techniques for sentiment
analysis." Proceedings of the 2012 ACM research in applied
IMDb dataset. Pre-processing of the dataset was done to computation symposium. 2012.
make it suitable to be fed to the classifier model. Bag of Words [10] Prasad, Sudha Shanker, et al. "Sentiment classification: an approach
approach was chosen for text representation in the work. for Indian language tweets using decision tree." International
Conference on Mining Intelligence and Knowledge Exploration.
Finally, four different traditional machine learning algorithms Springer, Cham, 2015

33

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 18,2020 at 15:44:18 UTC from IEEE Xplore. Restrictions apply.

You might also like