0% found this document useful (0 votes)
20 views7 pages

F Sentiment Analysis On Large Scale Amazon Product Review

This conference paper presents a sentiment analysis model for large-scale Amazon product reviews using supervised learning and active learning techniques. The authors aim to categorize customer feedback into positive and negative sentiments to assist in understanding consumer preferences. The methodology involves data acquisition, preprocessing, and feature extraction, achieving higher accuracy than existing research in the field.

Uploaded by

banu M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views7 pages

F Sentiment Analysis On Large Scale Amazon Product Review

This conference paper presents a sentiment analysis model for large-scale Amazon product reviews using supervised learning and active learning techniques. The authors aim to categorize customer feedback into positive and negative sentiments to assist in understanding consumer preferences. The methodology involves data acquisition, preprocessing, and feature extraction, achieving higher accuracy than existing research in the field.

Uploaded by

banu M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325756171

Sentiment analysis on large scale Amazon product reviews

Conference Paper · June 2018


DOI: 10.1109/ICIRD.2018.8376299

CITATIONS READS

252 24,765

3 authors:

Tanjim Ul Haque Nudrat Nawal Saber


Saarland University The University of Texas at Arlington
2 PUBLICATIONS 257 CITATIONS 1 PUBLICATION 252 CITATIONS

SEE PROFILE SEE PROFILE

Faisal Muhammad Shah


Ahsanullah University of Science and Technology
86 PUBLICATIONS 1,636 CITATIONS

SEE PROFILE

All content following this page was uploaded by Tanjim Ul Haque on 03 June 2019.

The user has requested enhancement of the downloaded file.


2018 IEEE International Conference on Innovative Research and Development (ICIRD)

Sentiment Analysis on Large Scale


Amazon Product Reviews
Tanjim Ul Haque Nudrat Nawal Saber Faisal Muhammad Shah
Department of Department of Department of
Computer Science & Engineering Computer Science & Engineering Computer Science & Engineering
Ahsanullah University of Science Ahsanullah University of Science Ahsanullah University of Science
& Technology & Technology & Technology
Dhaka, Bangladesh Dhaka, Bangladesh Dhaka, Bangladesh

Abstract—The world we see nowadays is becoming more straightforwardly influence future customer purchase decisions
digitalized. In this digitalized world e-commerce is taking the [1]. Similarly, negative reviews often cause sales loss [2]. For
ascendancy by making products available within the reach of those understanding the feedback of customers and polarizing
customers where the customer doesn’t have to go out of their accordingly over a large amount of data is the goal. There are
house. As now a day’s people are relying on online products so some similar works done over amazon dataset. In [5] did
the importance of a review is going higher. For selecting a opinion mining over small set of dataset of Amazon product
product, a customer needs to go through thousands of reviews to reviews to understand the polarized attitudes towards the
understand a product. But in this prospering day of machine products.
learning, going through thousands of reviews would be much
easier if a model is used to polarize those reviews and learn from In our model, we used both manual and active learning
it. We used supervised learning method on a large scale amazon approach to label our datasets. In the active learning process
dataset to polarize it and get satisfactory accuracy. different classifiers are used to provide accuracy until reaching
satisfactory level. After getting satisfactory result we took
Keywords—Sentiment analysis, pool based active learning, those labeled datasets and processed it. From the processed
feature extraction, text classification, machine learning. dataset we extracted features that are then classified by
different classifiers. We used combination of two kinds of
I. INTRODUCTION approaches to extract features: the bag of words approach and
tf-idf & Chi square approach for getting higher accuracy.
As the commercial site of the world is almost fully undergone II. RELATED WORKS
in online platform people is trading products through different
e-commerce website. And for that reason reviewing products
before buying is also a common scenario. Also now a day, So far, much of the research papers related to product reviews,
customers are more inclined towards the reviews to buy a sentiment analysis or opinion mining has been done recently.
product. So analyzing the data from those customer reviews to In the work [3] Elli, Maria and Yi-Fan extracted sentiment
make the data more dynamic is an essential field nowadays. In from the reviews and analyze the result to build up a business
this age of increasing machine learning based algorithms model. They have claimed that demonstrated tools were robust
reading thousands of reviews to understand a product is rather enough to give them high accuracy. The use of business
time consuming where we can polarize a review on particular analytics made their decision more appropriate. They also
category to understand its popularity among the buyers all over worked on detecting emotions from review, gender based on
the world. the names, also detecting fake reviews. The commonly used
programming language was python and R. They mainly used
The objective of this paper is to categorize the positive and Multinomial Naïve Bayesian (MNB) and support vector
negative feedbacks of the customers over different products machine (SVM) as their main classifiers. In paper [4] the
and build a supervised learning model to polarize large amount author applied existing supervised learning algorithms to
of reviews. A study on amazon last year revealed over 88% of predict a reviews rating on a given numerical scale using only
online shoppers trust reviews as much as personal text. They have used hold out cross validation using 70% data
recommendations. Any online item with large amount of as training data and 30% data as testing data. In this paper the
positive reviews provides a powerful comment of the author used different classifiers to determine the precision and
legitimacy of the item. Conversely, books, or any other online recall values. The author in Paper [5] applied and extended the
item, without reviews puts potential prospects in a state of current work in the field of natural language processing and
distrust. Quite simply, more reviews look more convincing. sentiment analysis to data from Amazon review datasets. Naïve
People value the consent and experience of others and the Bayesian and decision list classifiers were used to tag a given
review on a material is the only way to understand others review as positive or negative. They have selected books and
impression on the product. Opinions, collected from users‟ kindle section review from amazon. The author in [6] aimed to
experiences regarding specific products or topics, build a system that visualizes the reviews sentiment in the form

978-1-5386-5283-1/18/$31.00 ©2018 IEEE


of charts. They have used data scraping from amazon url to get
the data and preprocessed it. In this paper they have applied
NB, SVM and maximum entropy. AS the paper claims that
they summarize the product review to be the main point so "helpful": helpfulness rating of the review
there is no accuracy showed. They showed their result in
statistical chart. In the paper [7] authors built a model for "reviewText": text of the review
predicting the product ratings based on rating text using a bag- "overall": rating of the product
of-words. These models tested utilized unigrams and bigrams.
They used a subset Amazon video game user reviews from "summary": summary of the review
UCSD Time-based models didn‟t work well as the variance in "reviewTime": time of the review (raw)
average rating between each year month, or day was relatively
small. Between unigrams and bigrams, unigrams produced the For data we selected three categories from Amazon products
most accurate result. And popular unigrams were extremely Electronics reviews, Cell Phone and Accessories Reviews and
useful predictor for ratings for their larger variance. Unigram Musical Instruments product reviews which consists of
results had a 15.89% better performance than bigrams. In paper approximately 48500 product reviews. Where 21600 reviews
[8] various feature extraction or selection techniques for are from mobile phones, 24352 are from electronics & 2548
sentiment analysis are performed. They collected Amazon from musical instruments data. From the formats used for
dataset at first and then performed preprocessing for stop words
and special characters‟ removal. They applied phrase level,
single word and multiword feature selection or extraction
technique. Naive Bayes is used as the classifier. They
concluded that Naive Bayes gives better result for phrase level
than single word and multiword. The main cons of this paper
are, they used only naive Bayes classifier algorithm from
which we cannot get a sufficient result. In paper [9] it has used
easier algorithms so it is easy to understand. The system gives
high accuracy on svm and so it cannot work properly on huge
dataset. They used support vector machine (svm), logistic
regression, decision trees method. In paper [10] tfidf is used
here as an additional experiment. It can predict rating by using
bag of words. But Classifiers used here are only few. They
used root mean square error, linear regression model. So, those
are some related works mentioned above, we tried to make our
work more efficient by choosing best ideas from them and
applied those together.
In our system, we used large amount of datasets so it gave analyzing the review polarity we used review Text & Overall
efficient result and we could take better decision. Moreover, we from it. We can see an overview of our methodology:
have used active learning approach to label datasets which can
dramatically accelerate many machine learning tasks. Our
system also consists of several types of feature extraction Figure 1: Work Process
methods. To the best of our knowledge, our proposed approach A. Data Acquisition
gave higher accuracy than the existing research works.
We acquired our dataset of 3 different JSON formats and
labeled our dataset. As we have a large amount or reviews
III. METHODOLOGY manually labeling was quite impossible for us. Therefor we
preprocessed our data and used Active learner to label the
datasets. As amazon reviews comes in 5-star rating based
Amazon is one of the largest E-commerce site as for that there generally 3 star ratings are considered as neutral reviews
are innumerous amount of reviews that can be seen. We used meaning neither positive nor negative. So we discard any
data named Amazon product data which was provided by review which contains a 3-star rating from our dataset and take
researchers from [14]. The dataset was unlabeled and to use it the other reviews and proceed to next step labeling the dataset.
in a supervised learning model we had to label the data. We
used three JSON files where the structure of the data is as Pool Based Active Learning:
follows: Active learning is a special case in semi-supervised learning
algorithm. The main fact is that the performance will be better
with less training if the learning algorithm is allowed to choose
"reviewerID": ID of the reviewer the data from which it learns [2]. Active learning system tries
"asin": ID of the product to solve data labeling bottleneck by querying for unlabeled
instance to be properly labeled by an expert or oracle. As
"reviewerName": name of the reviewer manually labeling the dataset is quite an impossible task so that
to reduce time complexity we use a special kind of semi- frequency (IDF). Each word or term has its own TF and IDF
supervised learning approach known as pull based active score. The TF and IDF product scores of a term is referred to
learning. In the process of our active learning we need to the TF*IDF weight of that term. Simply we can state that the
provide it some pre labeled datasets as training and testing and higher the TF*IDF score (weight) the rarer the term and vice
take unlabeled dataset. For using active learning, we need to versa. TF of a word is the frequency of a word.
provide some manually labeled reviews as training –testing
sets. Then from a pool of unlabeled dataset learning method IDF of a word is the measure of how significant that term is
will ask oracle or user to label few data. And it will run some throughout the corpus.
classifiers to calculate the accuracy. Accuracy shows whether When words do have high TF*IDF weight in content, content
the decision boundary is separating most the values in two will always be amongst the top search results, so anyone can:
classes. Higher the accuracy higher the data is being labeled. If
the accuracy is greater or equal to 90% then we take those data 1. Stop worrying about using the stop-words,
and combined it with already pre-labeled data to get our 2. Successfully find words with higher search volumes
labeled dataset. If not, we again consider help from the oracle and lower competition.
to label some more data. After the accuracy is greater than 90%
we considered the data to be labeled.
Chi Square: Chi square(X^2) is a calculation that is used to
determine how smaller the difference between the observed
B. Data Pre-Processing data and the expected data .
In this approach we have preprocessed our dataset then we
have divided data into training and testing set. We used
Tokenization: It is the process of separating a sequence of pipeline method to apply TF-IDF, Chi square and other
strings into individuals such as words, keywords, phrases, classifiers onto our dataset and got the results.
symbols and other elements known as tokens. Tokens can be Algorithm for proposed approach
individual words, phrases or even whole sentences. In the
process of tokenization, some characters like punctuation
marks are discarded. The tokens work as the input for different Input:
process like parsing and text mining. Labeled Data=labeled data obtained after active learning
process.
Removing Stop Words: Stop words are those objects in a
sentence which are not necessary in any sector in text mining. Output:
So we generally ignore these words to enhance the accuracy of Accuracy of classifiers;
the analysis. In different format there are different stop words Precision,Recall,F-1Measure for positive and deceptive values.
depending on the country, language etc. In English format //product review polarity accuracy
there are several stop words.
1. Load labeled data positive & negative
POS tagging: The process of assigning one of the parts of
speech to the given word is called Parts of Speech tagging. It is 2. Preprocessed labeled data
generally referred to as POS tagging. Parts of speech generally 3. for every X= {X1…Xn} in labeled data
contain nouns, verbs, adverbs, adjectives, pronouns,
conjunction and their sub-categories. Parts of Speech tagger or 4. Extractfeature(Xi)
POS tagger is a program that does this job. 5. Cross validate into training & testing set
6. Classifier.train()
C. Feature Extraction
7. Accuracy= classifier.accuracy()
Bag of Words: Bag of word is a process of extracting features 8. majority_voting(accuracy) using vote classifier
by representing simplified text or data, used in natural language 9. show result(accuracy,precision,recall,f1measure)
processing and information retrieval. In this model, a text or a
document is represented as the bag (multiple set) of its words. 10.end
So, simply bag of words in sentiment analysis is creating a list
extractfeature(text) return n-gram feature
of useful words. We have used bag of words approach to
extract our feature sets. After preprocessed dataset we used pos majority_voting(accuracy) return highest accuracy
tagging to separate different parts of speech and from that we
select nouns and adjectives and use those to create a bag of
words. Then we run it through a supervised learning and find D. Evaluating Measures:
our results and also the top used words from the review dataset.
Evaluate metrics play an important role to measure
classification performance. Accuracy measure is the most
TF-IDF:TF-IDF is an information retrieval technique which common for this purpose. The accuracy of a classifier on a
weighs a term‟s frequency (TF) and also inverse document given test dataset is the percentage of those dataset which are
correctly classified by the classifier [48]. And for the text measures. The classifiers were applied on different feature
mining approach always the accuracy measure is not enough to selection process where the common features from TF-IDF and
give proper decision so we also took some other metrics to bag of words gave best results for all the datasets.
evaluate classifier performance. Three important measures are
commonly used precision, recall, F-measure. Before discussing
with different measures there are some terms we need to get Dataset Classifier
Accuracy Accuracy
Precision Recall
F1
comfortable with- 10 Fold 5 Fold score

 TP (True Positive) represents numbers of data Linear support


93.57 88.34 0.96 0.97 0.97
correctly classified Vector machine ……
Multinomial
 90.28 84.41 0.89 0.92 0.91

ACCESSORIES
CELLPHINE &
FP (False Positive) represents numbers of correct data Naïve Bayes
misclassified Stochastic Gradient
91.88 84.93 0.9 0.93 0.91
Descent
 FN (False Negative) represents numbers of incorrect
data classified as correct Random Forest 92.72 88.20 0.967 0.967 0.97

 TN (True Negative) is the numbers of incorrect data Logistic regression 88.2 81.99 0.87 0.88 0.88
classified Decision tree 91.45 83.71 0.95 0.95 0.95
Table-1: Experiment result for cellphone & accessories data
Precision: Precision measures the exactness of a classifier,
how many of the return documents are correct. A higher Accuracy Accuracy F1
precision means less false positives, while a lower precision Dataset Classifier Precision Recall
10 Fold 5 Fold score
means more false positive. Precision (P) is the ratio of numbers
of instance correctly classified from total. It can be defined as- Linear support
94.02 89.76 0.9889 0.971 0.98
Vector machine
Multinomial
91.57 89.77 0.98 0.93 0.96
Naïve Bayes
MUSICAL

Stochastic Gradient
92.89 88.264 0.99 0.96 0.98
Recall: Recall calculates the sensitivity of a classifier; how Descent
many positive data it returns. Higher recall means less false Random Forest 93.56 88.51 0.98 0.97 0.975
negatives. Recall is the ratio of number of instance accurately
Logistic regression 91.34 87.14 0.96 0.95 0.95
classified to the total number of predicted instance. This can be
shown as- Decision tree 92.45 86.27 0.969 0.96 0.96

Table-2: Experiment result for musical Instruments data

F-Measure: Combining precision and recall produces single Accuracy Accuracy F1


Dataset Classifier Precision Recall
10 Fold 5 Fold score
metrics known as F-measure, and that is the weighted harmonic
mean of precision and recall. It can be defined as – Linear support
93.52 91.72 0.98 0.99 0.98
Vector machine
Multinomial
89.36 86.89 0.899 0.96 0.93
ELECTRONICS

Naïve Bayes
Stochastic
92.61 90.96 0.964 0.988 0.975
Accuracy: Accuracy predicts how often the classifier makes Gradient Descent
the correct prediction. Accuracy is the ratio between the Random Forest 92.89 91.14 0.968 0.988 0.978
number of correct predictions and the total number of
prediction. Logistic
88.96 87.843 0.919 0.955 0.937
regression
Decision tree 91.569 87.50 0.962 0.9669 0.96

Table-3: Experiment result for electronics data


IV. RESULTS

From all the experiments it can be seen that support vector


There were several machine learning algorithms used in our machine provided with greater accuracy in every dataset. As
experiment like Naïve Bayesian, Support vector Machine the working dataset is quite larger and support vector machine
Classifier (SVC), Stochastic Gradient Descent (SGD), Linear works better with large scale dataset without over fitting it.
Regression (LR), Random Forest and Decision Tree. We have And from these results highest accuracy was 94.02%.
conducted cross validation methods and 10 fold gave the best
accuracy. We conduct the best classifiers on 3 categories of
product reviews and see the results according to the evaluation
V. COMPARATIVE ANALYSIS sorting out unnecessary words. And finally taking the best
features extracted from the datasets and learning through
In this section our research was tried to be compared with proper classifiers it was possible to attain greater accuracy.
other related works. The comparative analysis was based on From the table it can be decided that the approaches used in
accuracy. The comparison can be seen in the table below- approaches our proposed model shows more effectiveness and
could achieve a better result than some of the related works.
Year &
Paper Title Dataset Accuracy
Citations VI. CONCLUSION AND FUTURE WORKS

72.95%
Amazon Reviews,business Review of In this research we proposed a supervised learning model to
analytics with sentiment 2016 cellphone& polarize a large amount of product review dataset which was
analysis [11] accessories unlabeled. We proposed our model which is a supervised
80.11% learning method and used a mix of 2 kinds of feature extractor
approach. We described the basic theory behind the model,
Sentimetn Analysis in reviews of books 84.44% approaches we used in our research and the performance
Amazon Reviews Using measure for the conducted experiment over quite a large data.
2013 (6)
Probalbilistic Machine We also compared our result with some of the similar works
Learning [5] reviews of Kindle 87.33%
regarding product review. We also went through different
kinds of research papers regarding sentiment analysis over a
text based dataset. We were able to achieve accuracy over 90%
Mining somparative with the F1 measure, precision and recall over 90%. We tried
opinions from customer Customer product
reviews for competitive
2011 (234)
reviews
61.00% different simulation using cross validation, training-testing
intelligence [12] ratio, and different feature extraction process for comparing
varying amount of data to achieve promising results. In most of
the cases 10 fold provided a better accuracy while Support
Vector Machine (SVM) provided best classifying results. It is
hard to gather huge amount of gold standard dataset for this
Amazing: A sentiment purpose as e-commerce sites have their limitations on giving
E commerce
mining & Retrieval System 2009 (125) 87.60% data publicly. Also scraping data can be a problem as we can‟t
reviews
[12]
scrape enough data to consider it as real-life public reviews
over different products.
70.00% Some future works which can be included to improve the
Review on books 70.00% model and also to make it more effective in practical cases.
80.00% Our future works include applying PCA (Principal Component
"Feature Selection Methods Analysis) in active learning process to fully automate data
62.00%
in Sentiment Analysis and labeling process with less assistance from the oracle. The
Sentiment Classification of 2016 Review on music 80.00%
Amazon Product Reviews"
model can be incorporate with programs that can interact with
68.00% customer seeking a score of a particular product. As we used a
[8]
62.00% large scale dataset we can apply the model on local market
Review on 80.00% sites to get better accuracy and usability. And lastly we will
Camera
68.00% try to continue this research until we generalize this model to
all kinds of text based reviews and comments.
Review of
cellphone& 93.57%
Proposed accessories
Model
2018 Review of REFERENCES
Electronics 93.52%
Reviews of music
94.02% [1] Samha,Xu,Xia, Wong & Li “Opinion Annotation in Online
Instruments Chinese Product Reviews.” In Proceedings of LREC
Table-4: Comparative Analysis conference, 2008.

Different researches listed in the table have conducted [2]. Nina Isabel Holleschovsky, “The social influence factor:
Impact of online product review characteristics on consumer
different pre-processing steps and feature extraction processes.
purchasing decisions”, 5 th IBA Bachelor Thesis Conference,
As in our research we tied to improvise all the extraction
Enschede, The Netherlands 2015
processes and preprocessing steps and pick the best accuracy
from it. Pull based active learning process have contributed [3]Elli, Maria Soledad, and Yi-Fan Wang. "Amazon Reviews,
labeling and selecting the best reviews as our training and business analytics with sentiment analysis." 2016
testing data. Use of different preprocessing process helped
[4]Xu, Yun, Xinhui Wu, and Qinxia Wang. "Sentiment [10]Text mining for yelp dataset challenge; Mingshan Wang;
Analysis of Yelp„s Ratings Based on Text Reviews." (2015). University of California San Diego, (2017)
[5] Rain, Callen. "Sentiment Analysis in Amazon Reviews [11] Elli, Maria Soledad, and Yi-Fan Wang. "Amazon
Using Probabilistic Machine Learning."Swarthmore College Reviews, business analytics with sentiment analysis." 2016
(2013).
[12] Xu, Kaiquan, et al. "Mining comparative opinions from
[6] Bhatt, Aashutosh, et al. "Amazon Review Classification customer reviews for Competitive Intelligence." Decision
and Sentiment Analysis." International Journal of Computer support systems 50.4 (2011): 743-754.
Science and Information Technologies 6.6 (2015): 5107-5110.
[13] Miao, Q., Li, Q., & Dai, R. (2009). AMAZING: A
[7]Chen, Weikang, Chihhung Lin, and Yi-Shu Tai."Text-Based sentiment mining and retrieval system. Expert Systems with
Rating Predictions on Amazon Health & Personal Care Product Applications, 36(3), 7192-7198.
Review." (2015)
[14] He, Ruining, and Julian McAuley. "Ups and downs:
[8]Shaikh, Tahura, and DeepaDeshpande. "Feature Selection Modeling the visual evolution of fashion trends with one-
Methods in Sentiment Analysis and Sentiment Classification of class collaborative filtering." Proceedings of the 25th
Amazon Product Reviews.",(2016) International Conference on World Wide Web.International
World Wide Web Conferences Steering Committee, 2016.
[9]Nasr, Mona Mohamed, Essam Mohamed Shaaban, and
Ahmed Mostafa Hafez. "Building Sentiment analysis Model
using Graphlab." IJSER, 2017

View publication stats

You might also like