MS Thesis rns26
MS Thesis rns26
MS Thesis rns26
College of Engineering
A Thesis in
Industrial Engineering
by
Rutvik Sharedalal
Master of Science
May 2019
The thesis of Rutvik Sharedalal was reviewed and approved* by the following:
Saurabh Basu
Assistant Professor of Industrial and Manufacturing Engineering
Janis Terpenny
Peter and Angela Dal Pezzo Professor
Head of the Department of Industrial and Manufacturing Engineering
ABSTRACT
Social media has given ample opportunity to the consumer in terms of gauging the quality of the
products by reading and examining the reviews posted by the users of online shopping platforms.
Moreover, online platforms such as Amazon.com provides an option to the users to label a review
as ‘Helpful’ if they find the content of the review valuable. This helps both consumers and
selected helpful reviews. However, the recently posted reviews get comparatively fewer votes and
the higher voted reviews get into the users’ radars first. This study deals with these issues by
building an automated text classification system to predict the helpfulness of online reviews
irrespective of the time they are posted. The study is conducted on the data collected from
Amazon.com consisting of the reviews on fine food. The focus of previous research has mostly
remained on finding a correlation between the review helpfulness measure and review content-
based features. In addition to finding significant content-based features, this study uses three
different approaches to predict the review helpfulness which includes vectorized features, review
and summary centric features, and word embedding-based features. Moreover, the conventional
classifiers used for text classification such as Support vector machine, Logistic regression, and
Multinomial naïve Bayes are compared with a decision tree-based ensemble classifier, namely
Extremely randomized trees. It is found that the Extremely randomized trees classifier outperforms
the conventional classifiers except in the case of vectorized features with unigrams and bigrams.
Among the features, vectorized features perform much better compared to other features. This
study also found that the content-based features such as review polarity, review subjectivity,
review character and word count, review average word length, and summary character count are
TABLE OF CONTENTS
Chapter 1 Introduction............................................................................................................................... 1
References .................................................................................................................................................. 68
vi
LIST OF FIGURES
Figure 4-4: Distribution of binary response variable in terms of ‘Review Score’ attribute ..........24
LIST OF TABLES
Table 4-1: Descriptive statistics of ‘Helpfulness Numerator’ and ‘Review Score’ attributes .......22
Table 5-1: Classifiers and their hyperparameters for vectorized features .....................................35
Table 5-4: Performance evaluation of TF-IDF vectorizer matrices with LSA ..............................37
Table 5-5: Classifiers and their hyperparameters for review and summary centric features ........40
Table 5-8: Classifiers and their hyperparameters for word embedding based features .................44
ACKNOWLEDGEMENTS
First and foremost, I would like to thank my advisor Dr. Soundar Kumara, Professor of Industrial
Engineering, for his valuable support and exemplary guidance throughout the course of my thesis.
I am also grateful to him as his teaching has helped me acquire great knowledge and skills.
I would also like to express my gratitude to Dr. Saurabh Basu, Assistant Professor of Industrial
and Manufacturing Engineering, for taking his time to review my thesis and giving constructive
feedback.
I would also like to thank Dr. Janis Terpenny, Department Head of Industrial and Manufacturing
I take this opportunity to thank my colleague Visanu Chumongkhon for sharing his knowledge
Finally, I express my profound gratitude to my family and friends for their continuous support and
encouragement throughout my graduate studies. This would not have been possible without all of
you.
1
Chapter 1
Introduction
Business firms are now using online platforms and social media to market their products and
services. The increasing use of social media and the internet has significantly changed the way
consumers shop for products. Social media also helps customers receive better accessibility to
product information. Studies infer that customers show more trust towards the online consumer
reviews than the information provided by the vendors (Salehan et al., 2016). Moreover, online
consumer reviews are more user-oriented and describe the products from the user’s perspective.
Consumers can judge the quality of the products by reading and examining numerous online
reviews available on online public platforms such as Amazon.com. However, if these reviews are
not organized or presented properly, can lead the customers to an ambiguous situation. The
overwhelming quantity of the reviews can affect the unbiased decision making of the consumers
mainly because of the uneven quality or untruthfulness (spam) of the reviews (Salehan et al., 2016;
Liu et al., 2008). To overcome this problem, Amazon.com provides ‘helpful’ field along with the
review text which lets the user vote whether the information provided in the review was valuable
or not. Consumers can use this as a reference for making a rational purchasing decision.
1.1 Motivation
It has been estimated that the ‘Helpfulness Voting’ system brings in Amazon.com about $2.7
billion in additional revenue (Cao et al., 2011). It is imperative for any business organization to
perceive which factors determine the helpfulness of the online reviews. This can help the online
business managers increase the revenue by designing and implementing an automated system for
Although this voting process is an improvement in building the automated classification system,
there are some detrimental issues that need to be taken care of before proceeding further. First,
reviews which are posted most recently might have very few votes or no votes at all and therefore,
identifying the helpfulness of these reviews becomes very difficult. Moreover, the highest voted
reviews get viewed first before the newly published reviews which are yet to be voted (Liu et al.,
2008). For example, Kim et al. (2006) found in their study that, out of 20919 Amazon reviews for
all MP3 player products, 38% received 3 or lesser helpfulness votes. It is also known that there is
always a considerable bias in the human analysis of textual information as users like to believe
what matches with their dispositions (Liu et al., 2012). Studies have also shown that the ratings
that the reviews get do not convey profound information as they are either extremely high or
In light of all the above factors, it is important to build an automated text-based classification
system that predicts the helpfulness of online consumer reviews irrespective of when the reviews
get posted. If the newly posted reviews are instantaneously deemed ‘Helpful’ or ‘Not helpful’ by
the system built on machine learning classification algorithms, it gives consumers a reasonable
choice to make their purchasing decisions rather than relying solely on older reviews. Text mining
and Natural Language Processing (NLP) have been used in the past to build a text classification
system to identify spam reviews (Crawford et al., 2015; Shojaee et al., 2013; Lilleberg et al., 2015).
The same can be applied to detect whether the reviews are helpful or not.
3
The goal of this paper is to develop an automated text-based classification system that can
accurately predict the helpfulness of Amazon online consumer reviews. The problem considered
is to perform a binary classification using the combination of text-based features and machine
learning classification algorithms. The binary classes will be defined as - ‘1’ being ‘Helpful’ and
‘0’ being ‘Not helpful’. The helpfulness measure will be determined based on the number of users
who voted the review as ‘helpful’. Multiple types of text-based features will be used for this study
which includes matrix based vectorized features, word embedding based features, and features
extracted from review and summary text such as structural, syntactic, semantic features and meta-
data. There will also be an attempt to combine some of the above features with reference to the
previous work. The prediction model will be obtained based on the available training data which
consists of the product review text, review summary, review rating, details on the helpfulness votes
and other product and user related information. The algorithms that will be used for classification
purpose are- Support Vector Machine, Multinomial naïve Bayes, Logistic regression, and
Extremely randomized trees. This thesis also discusses the required text pre-processing methods
as they help avoid overfitting and reduce computational complexity by removing noise from the
The rest of the thesis is structured as follows. Chapter 2 discusses the previous work and provides
a detailed explanation of the features and algorithms. Chapter 3 provides the overview of the data.
the analysis and summarizes the results. Conclusions of the study and the scope of potential future
Chapter 2
Literature survey and Background
The focus of previous research on the review helpfulness has mostly remained on automatizing
the prediction of review helpfulness to tackle the issues regarding the enormous quantity and
uneven quality of online consumer reviews. A considerable amount of work has been done in
finding correlations between review helpfulness and text-based features with the use of text mining
The study conducted by Salehan et al. (2016), investigated the effect of sentiment on the
helpfulness of online consumer reviews. The study provided insights into the performance of
online reviews by investigating the effect of sentiment polarity on the helpfulness. Their research
model included both the title related and review related measures. Apart from polarity and
sentiment of the reviews, the other main factors included in the model were the length and
longevity of the reviews. As websites like Amazon.com sort reviews by ‘most helpful’ rather than
‘most recent’, it was justified to include longevity as one of the factors in the study. It was found
that the reviews with neutral polarity affected helpfulness significantly compared to positive and
negative ones. Also, the length of a review was found to be another significant factor as a longer
In another study, Cao et al. (2011) emphasized the question regarding why some reviews get more
helpfulness votes compared to other reviews. For this, they studied the impact of basic, stylistic
5
and semantic characteristics of online reviews on the number of helpfulness votes they receive
using the data collected from CNET Download.com. The study was conducted using Ordinal
Logistic Regression model with the accuracy being determined by 3 criteria namely –
Misclassification rate, AIC (Akaike’s Information Criterion) and Lift ratio. It was concluded that
the semantic characteristics were more influential compared to the other characteristics and
Liu et al. (2008), built a non-linear regression model using the features extracted from the IMDB
movie reviews dataset. The factors considered by them were – reviewer expertise, writing style,
timeliness of the reviews, length of the review, polarity of the review, and the average rating of all
the reviews. The helpfulness measure was approximated as the ratio of the number of users who
found the review helpful to the total number of users who voted whether the review was helpful
or not. Out of all the factors considered, the factors that were deemed the most important were –
reviewer expertise, writing style, and timeliness. These factors were quantified and used in the
model to check their individual and combined effectiveness with the help of MSE score. The best
Kim et al. (2006) focused on performing an instantaneous assessment of the review helpfulness to
provide rapid feedback to the review authors. In their study, the effect of various text-based
features on the review helpfulness was studied using SVM regression modeling. The feature list
included – structural features, lexical features, syntactic features, semantic features, and meta-data.
It was found that the most useful features were the length of the review, unigrams of the review
To deal with the problem of assessing a large number of reviews on online market platforms,
Ghose et al. (2007) built ranking mechanisms to quickly locate underlying useful reviews. This
6
review usefulness of the user-oriented approach was determined by the helpfulness votes a review
obtained, whereas manufacturer-oriented approach used sales of the product to gauge the
usefulness. Econometric and subjectivity analysis was performed using regression modeling on
the panel data collected from Amazon.com. It was found in the study that the reviews which
contained a mixture of subjective and objective elements got more helpfulness votes.
This section sheds some light on the work done related to text mining and NLP in the domain other
than that of the review helpfulness. The various text-based features and machine learning
algorithms used in the domain of spam review detection have been discussed here. This section
also emphasizes the use of Support Vector Machine in the text categorization problems.
Crawford et al. (2015) conducted a survey of spam detection in which they discussed features
extracted from text using Natural Language Processing and compared different approaches for
classifying the reviews as ‘spam’ or ‘not spam’. The features used in the study included Bag of
words, Term Frequency, POS tagging, and other lexical and semantic features. The classification
task was performed using algorithms such as SVM (Support Vector Machine), Naïve Bayes, and
Logistic Regression. Among the algorithms used, SVM was found to perform better than the Naïve
Bayes and Logistic Regression while occasionally beaten by them. Moreover, the performance
In their study of detecting deceptive opinions, Shojaee et al. (2013) came up with stylometric
features to distinguish the spammer writing style. It was stated in their study that the spammers try
to change their writing style and verbal usage by using simpler, short and fewer average syllables
7
per word. The features were categorized into 2 types, i.e. lexical and syntactic features. Example
of some of these features includes the total number of tokens, average sentence length, average
token length, occurrences of uppercase letters. The machine learning classification algorithms used
were Support Vector Machine and Naïve Bayes. Features were analyzed separately as well as
combined and compared using F-score measure. Highest accuracy was achieved when the features
were combined irrespective of the algorithm applied. Moreover, SVM performed better overall
Previous research has also shown that combining the semantic features and word frequency-based
features can outperform either of them used alone. In their study of text classification, Lilleberg et
al. (2015) combined the word vectors obtained from the Word2vec algorithm with the TF-IDF
measure of the words. The assumption was that the TF-IDF matrix does not represent the semantic
relationships between the tokens which Word2vec provides. Besides this, various combinations
were tried in terms of the inclusion of stop words. The ’20 newsgroup’ text data was used for the
study and Linear SVC was used as the classification algorithm. Combining Word2vec with TF-
IDF without the stop words rendered the highest accuracy among all the different combinations
except in some rare cases where it could not outperform the TF-IDF without the stop words.
As far as machine learning classification algorithms are concerned, Naïve Bayes and Support
Vector Machine perform better for the text classification problems (Vinodhini et al., 2012).
Joachims (1998) has given theoretical evidence on why SVM performs well for text categorization.
1) SVM has the potential to handle the large feature spaces related to the text categorization
problems as its functionality does not depend upon the dimensionality of the problem.
2) SVM is well suited for problems with sparse instances and dense concepts (High
3) The tendency of most of the text classification problems is to be linearly separable and the
The literature denotes that various text-based features have been used in the studies of text mining.
As far as the research in review helpfulness is concerned, the focus of the studies has mostly
remained on exploring review and reviewer centric characteristics such as review length, review
polarity, review subjectivity, reviewer’s writing style, and so on. The idea behind this was to come
up with the significant features which are highly correlated to the review helpfulness measure in
order to rank reviews in terms of their perceived helpfulness. Even though the word embedding
based features have been used in other domains of text classification, it has remained highly
vectorized features and word embedding based features are yet to be compared with the review
content-based features in predicting the review helpfulness. Moreover, this study also compares
the performance of a decision tree-based ensemble classifier - Extremely randomized trees - with
the conventional algorithms used for text classification such as Support vector machine,
Multinomial naïve Bayes and, Logistic regression. The further sections discuss the functionality
This set of features uses the vectorization process to transform a collection of text documents into
This is a matrix-based feature which converts the collection of text documents into a matrix of
T1 T2 T3 T4
Table 2-1 shows the snapshot of the count vectorizer matrix where D1, D2 denotes unique text
documents and; T1, T2, T3, T4 denotes the unique terms that the document contains. For example,
if there are M unique documents and N unique words present in a corpus of text documents, the
matrix obtained will be of the size M x N. The value in the cell denotes the number of times a
specific word has appeared in a particular document. For instance, F11 denotes the frequency with
which Term-1 appears in the Document-1. The columns of this matrix work as unique features
when the matrix is used as an input to any machine learning classification algorithm. The limitation
of this feature set is that it does not consider the semantic relation between the terms as it counts
the frequency of each word separately. In this case, a slight modification called ‘n-gram’ can be
used which considers multiple terms as a single entity and then calculates the frequency of its
10
appearance in the document. For instance, if bi-gram is used, it calculates the frequency with which
This feature works in the same way with the only difference being that it calculates TF-IDF value
of a term instead of the frequency with which it appears in a particular document. TF-IDF is
basically a multiplication of two terms – Term Frequency and Inverse Document Frequency. The
mathematical expressions of these terms are given below (Pedregosa et al., 2011):
𝑻𝑭(𝒕, 𝒅) = 𝒇𝒕,𝒅
𝑵
𝑰𝑫𝑭(𝒕, 𝒅) = 𝐥𝐨𝐠
|𝒅 ∈ 𝑫: 𝒕 ∈ 𝑻|
Hence, TF is denoted by the number of times term t appears in document d. Whereas, IDF is the
logarithmic ratio of the total number of documents in the corpus and the number of documents in
which the term t appears (Pedregosa et al., 2011). The justification for multiplying the term
frequency value with the inverse document frequency is to give less weight to the commonly
occurring words which do not convey much information. IDF provides a low value for frequently
occurring terms as the number of documents in which the term occurs is in the denominator. Like
the counter vectorizer matrix, the modifications such as ‘n-grams’ can also be applied to this
feature set. While calculating the IDF value, 1 is added in the denominator as a smoothing
Latent Semantic Analysis is a technique which explores the hidden concepts in text data by
converting each document and word in a vector form with vector elements representing the
concepts (Thomo et al., 2009). The idea behind the technique is to find the semantic relationships
between the document-document, term-term, and term-document with the assumption that the
words and documents with similar context or similar meaning will have a smaller distance between
The vector representation of the terms and documents is obtained by applying Singular Value
Decomposition to a bag of words matrix such as Count Vectorizer or TF-IDF Vectorizer discussed
above. Let A be the count vectorizer matrix of the dimension MxN, where M is the total number
of unique documents and N is the total number of unique terms in the text corpus. This matrix can
be decomposed using SVD (Singular Value Decomposition) into 3 separate matrices as shown
below:
𝑨 = 𝑫 ∙ 𝑺 ∙ 𝑾𝑻
S is a diagonal matrix consisting of singular values along the diagonal. If the smaller singular
values are ignored and replaced by zero, a new matrix with reduced dimensionality is obtained.
This matrix, if multiplied with the original document and term matrices D and W, a new matrix is
obtained with the same dimensions as those of matrix A but with the different elements. This new
matrix has been shown to perform better than the original matrix in terms of uncovering the
underlying semantic relationships (Landauer et al., 1998). The rows of this matrix are used as
word vectors and columns are used as document vectors in a text mining problem.
12
2.4.1 Word2vec
This concept was first created by Mikolov et al. (2013) at Google. This method uses 2-layered
shallow neural networks to represent words in a vector space consisting of several hundred
dimensions. This algorithm has 2 variants namely, CBOW (Continuous Bag of Words) and Skip-
gram.
Figure 2-1: CBOW and Skip-gram model architectures (from Mikolov et al., 2013)
As it can be seen from Figure 2-1, the CBOW model predicts a word based on the surrounding
context words, whereas the Skip-gram model uses the current word to predict the context words.
Before the words are inserted into the model, they are vectorized using the one-hot encoder. For
example, the first word in the corpus of vocabulary V will have a vector of elements {X1, X2,……,
Xv} where X1 is 1 and all other elements are zero. Word vectors are trained using the back-
13
propagation method and the Softmax function is used at the output layer to calculate the probability
of the outcome (Rong X., 2014). Once the model is trained, the weights between the projection
layer and the output layer represent the elements of word vectors for the words which are supposed
to be predicted. As far as performance is concerned, Skip-gram is deemed better when the analysis
involves capturing the semantic relationship between the words (Mikolov et al., 2013). In a text
classification problem, a document or a text needs to be represented in the form of the words it
includes. One possible approach is to calculate the element-wise arithmetic mean of the word
vectors which yields a single vector for the document. Each of the elements of this document
This approach was used by Lilleberg et al. (2015) in their study of text classification. A simple
Word2vec approach only provides the vector or the position of a word in an n-dimensional vector
space but it does not tell how important it is or how frequently it appears in a particular document.
The idea behind using this approach is to consider word importance along with its vector
representation. In a text classification problem, the word vectors of a particular document can be
multiplied with the corresponding TF-IDF values and then the average can be taken to come up
𝟏
𝑽𝑫 = × ∑ 𝑽𝒘 ∙ 𝑻𝑰𝒘,𝑫
𝒘
𝒘∈𝑫
Where, VD = Document vector, Vw = Word vector, TIw,D = TF-IDF value of the word for that
document
14
The review related features that have been used in this study include structural, syntactic and
semantic features of the review text. The structural features of the text consist of the word count
of the review text, character count of the review text, and average length of the words used in the
review text, whereas the syntactic features include number of punctuation marks used in the review
text, the number of proper case words used in the review text, and the number of uppercase words
The semantic features of the text include the polarity and subjectivity of the review text. Polarity
denotes whether the text is positive, negative or neutral in the sentiment, whereas subjectivity
Along with the review text, the text included in the summaries of the reviews has also been
analyzed in this study. The analysis of the summary text involves structural and semantic features.
However, the syntactic features are not considered for the summary due to the lesser content
compared to that of the review text. Semantic features consist of the polarity and subjectivity of
the text.
2.7 Meta-data
Meta-data features are the features which are not directly related to the linguistic features of the
text. The meta-data feature that has been included in this study is the rating of the product given
This model comes from the family of probabilistic classifiers which are based on applying the
Bayes’ theorem with the naïve assumption of independence between the features. In this model,
the document is represented as a bag of words and then the document in each class is modeled as
samples drawn from a multinomial word distribution (Aggarwal et al., 2012). The conditional
probability of a document given a class is then calculated by multiplying the probability of each
observed word in the corresponding class. Finally, the Bayes’ rule is applied to calculate the
posterior probability of a class given a document and the class with the highest posterior
formulated as shown below, assuming there are m unique classes and a document is represented
Logistic regression is a regression-based linear classifier which is used to model discrete response
variables (Aggarwal et al., 2012). The coefficients of the regression model are obtained by
optimizing a conditional likelihood function. As in the case of Multinomial Naïve Bayes model,
the class is assigned to a document which has a higher posterior probability. In this model, the
𝒆(𝑨∙𝑿𝒊 +𝒃)
𝑷(𝑪𝒊 |𝑿𝒋 ) =
𝟏 + 𝒆(𝑨∙𝑿𝒊 +𝒃)
The above posterior probability can be transformed using a logistic function and represented by
the linear combination of the features. This helps in conceiving the problem in terms of simple
linear regression.
The primary principle behind the Support Vector Machine is to sample and find a separator in a
feature space which can optimally separate the different classes (Aggarwal et al., 2012). The
problem of finding the best hyperplane is an optimization problem and SVM uses quadratic
Figure 2-2: Maximum margin of separation for SVM (from Aggarwal et al., 2012)
As it can be seen in the Figure 2-2, hyperplane A provides the maximum margin of separation as
the normal distance of any of the data points irrespective of the class, is the largest from this
hyperplane. Moreover, the normal vector to this hyperplane is a direction in which the maximum
17
discrimination is obtained in the feature space. Since SVM utilizes an appropriate combination of
the features to come up with the optimal direction of separation in the feature space, it is considered
highly robust to the high dimensionality of the feature set (Aggarwal et al., 2012).
This algorithm was first proposed by Geurts et al. (2006). It belongs to the family of decision trees
and is a variant of the random forest algorithm. Like the random forest, it is an ensemble of trees
but it differs in a way that it uses the whole learning sample for a single tree rather than a bootstrap
sample. Moreover, it splits the nodes by choosing cut-off points randomly, whereas random forest
splits the nodes optimally by using the Gini index or Information gain criterion (Geurts et al.,
2006). The rationale behind this algorithm is to reduce the variance of the decision trees by
introducing randomization in both the features and cut-off point selection during node splitting.
For each node, the cut-off points are randomly generated for each of the sampled features and the
point with the best score is considered as a cut-off point for that particular node. The score is
calculated by the normalization of information gain. Geurts et al. (2006) have also pointed out in
their paper that extremely randomized trees are computationally efficient than the random forest
Chapter 3
Overview of the data
The dataset is provided by Kaggle website and was collected and published first by McAuley et
al. (2013) in their research related to online reviews. The dataset consists of reviews on fine food
posted on Amazon.com. It has a total of 568,454 reviews on 74,258 products. The attributes and
The review text contains a detailed description of the product from the user’s perspective.
Moreover, it also describes the overall sentiment of the user toward the product apart from the
description of the product itself. Some of the examples of the review text are shown below:
1) “This is great stuff. Made some really tasty banana bread. Good quality and lowest price
in town.”
2) This coffee is great because it’s all organic ingredients! No pesticides to worry about plus
3) These condiments are overpriced and terrible. The classic is disgustingly sweet. The spiced
On the other hand, the summary text represents a user’s sentiment in a very confined manner. The
information is conveyed in few words in this case. Some of the examples of the summary text are
shown below:
2) “Waste of money”
3) “Great beans!!!”
4) “Big disappointment”
In this study, one of the approaches uses the features extracted from the review and summary text
in order to come up with the features that contribute significantly in predicting the review
helpfulness.
20
As can be seen in Table 3-1, the helpfulness related measures included in the data are ‘Helpfulness
numerator’ and ‘Helpfulness denominator’. ‘Helpfulness numerator’ denotes the total number of
users who found the review helpful and ‘Helpfulness denominator’ denotes the total number of
people who voted whether the review was helpful or not. Earlier Amazon.com used to provide
both the ‘Yes’ and ‘No’ options to its users in order for them to vote for the helpfulness. However,
it now provides only the ‘Yes’ option to its users. Hence, if a user finds a review helpful, he votes
for it otherwise he does not do anything. In order to make this study relevant, the process of
response variable generation will solely be based on the ‘Helpfulness numerator’ measure which
denotes the total number of users who voted in the favor of helpfulness. However, ‘Helpfulness
denominator’ measure will be used as a reference to discard the reviews for which the users did
not vote at all. Review which has ‘Helpfulness numerator’ value more than a specific threshold
will be considered as ‘Helpful’ otherwise it will be considered as ‘Not helpful’. The process of
In the approach consisting of the review and summary centric features, there is an attempt to find
a correlation between the semantics of the text and the review helpfulness measure through the
polarity and subjectivity of the review and summary text. The polarity of the text denotes how
positive or negative the sentiment of the text is, whereas subjectivity denotes how subjective or
objective the review statement is. In this study, the polarity is quantified in the range [-1,1] where
‘-1’ denotes extremely negative sentiment and ‘1’ denotes extremely positive sentiment. On the
other hand, subjectivity is quantified in the range [0,1] where ‘0’ denotes ‘highly objective’ and
Chapter 4
Methodology
This section discusses in detail the process of building an automated text classification system for
predicting the helpfulness of online fine food reviews posted on Amazon.com. It encapsulates the
following steps:
1) Performing exploratory data analysis for generating the binary response variable
2) Explaining various text preprocessing steps which are used to remove noisy terms
3) Explaining the model training procedure and the approaches used for it
Response variable
generation
Text
preprocessing
Training models
for classification
Model validation
This section discusses the data manipulation performed before training the model and also
As it can be seen in Table 3-1, the attribute ‘Helpfulness Denominator’ denotes the number of
users who voted whether they found the review helpful or not. Since the goal of this study is to
build a model which predicts whether the review is helpful or not, the reviews which did not get
any vote - either helpful or not helpful, will not be considered in the study. This reduces the number
of observations from the original 568,454 to 298,402. Now, the next step is to examine the
distribution of the attributes ‘Helpfulness Numerator’ and ‘Score’ by looking at their descriptive
statistics.
Table 4-1: Descriptive statistics of ‘Helpfulness Numerator’ and ‘Review Score’ attributes
The attribute ‘Helpfulness Numerator’ denotes the number of users who indicated that the review
was helpful. In this study, this attribute will be considered as a reference for generating the binary
response variable. As Table 4-1 shows, the median value of this attribute is ‘1’. Let us now evaluate
the frequency distribution of the attribute with its median working as a threshold.
23
As it can be inferred from Figure 4-2, dividing the ‘Helpfulness Numerator’ attribute into 2 parts
with threshold ‘1’ gives us the set of observations of almost equal size. The specific sample sizes
are 150241 and 148161 for values less than or equal to 1 and greater than 1 respectively. Hence,
this particular threshold will be used as a reference for generating the binary response variable as
it helps avoid the intricacies associated with the class imbalance problem. From now on, the class
with the values ‘less than or equal to 1’ will be referred as class ‘0’ (Not helpful) and the one with
The next step is to examine the distribution of the binary class attribute in terms of the ‘Score’
attribute. First of all, let us examine the distribution of the ‘Score’ attribute and then we will show
how the binary classes are distributed for each value of the score attributes. As described in Table
3-1, ‘Score’ attribute denotes the rating given to a particular product by the reviewer in his review
and it consists the integer values from 1 to 5 with ‘1’ being ‘extremely low’ and ‘5’ being
‘extremely high’.
24
Figure 4-4: Distribution of binary response variable in terms of ‘Review Score’ attribute
As it can be seen from the Figure 4-3, the distribution of the ‘Score’ attribute is highly skewed
towards the score ‘5’ which means the majority of the reviews have ‘extremely high’ rating.
Moreover, Figure 4-4 shows that the distribution of binary class is the same for each value of the
score attribute. Hence, it can be inferred that the ‘Score’ attribute does not contribute much in
Text preprocessing is highly essential in working with text classification problems. It helps in
improving the computational efficiency and avoiding the overfitting problem by eliminating the
noisy features. We discuss the various text preprocessing methods that are applied before training
4.2.1 Tokenization
Tokenization is basically the process of splitting the sentences into words. Each word is then
considered as a separate token. This is the first step in text preprocessing on which all the
Since the semantics of a word or a phrase do not depend upon which case the word is written in,
the uppercase words are converted to lowercase to avoid potential duplication of the words. For
example, the words ‘DOG’ and ‘dog’ convey the same meaning. This conversion helps in reducing
In this step, the punctuation marks are removed from the text as they do not provide any extra
information while extracting the semantics from the text (Hotho et al., 2005).
In natural language processing, stop words are the most commonly used words which do not
convey much meaning (Hotho et al., 2005). Some of the examples of the stop words include short
26
function words such as ‘a’, ‘an’, ‘the’, ‘is’, ‘are’, ‘which’, ‘at’ and, ‘on’. ‘NLTK’ library of python
4.2.5 Lemmatization
Lemmatization is a text normalization process of reducing the inflectional form of words into base
or dictionary form which is called ‘lemma’. For example, lemmatization converts the word ‘ran’
into its base form ‘run’. In lemmatization, complete morphological analysis of words is done to
ensure that the base word belongs to the dictionary (Manning et al., 2008). In this way,
lemmatization has a slight edge over its counterpart ‘stemming’ method which usually removes
the prefix or suffix associated with the word. The root word in stemming is not required to be a
valid word from the language. Keeping this in mind, lemmatization will be used in this study to
reduce the inflectional form of words. This study uses ‘TextBlob’ library of python programming
In this study, 3 different approaches are used to perform the review classification task with each
of them combined with 4 different classification algorithms. This will provide us with 12
combinations of the result. Moreover, each approach in itself contains multiple sets of features
which are also compared in terms of accuracy. This section discusses each approach in detail.
In the first approach, the vectorized features will be used to perform classification. As described
in the chapter-2 of this thesis, the corpus of documents is represented in the matrix format with
row denoting the documents and column denoting the unique words of the corpus. The values of
matrix elements can be term frequency or the combination of term frequency and inverse document
frequency, depending upon the approach. Moreover, applying dimensionality reduction technique
27
such as Latent Semantic Analysis (LSA) has proved to capture the underlying semantic
relationship between words better than the original matrix (Landauer et al., 1998). Table 4-2
Feature Description
Matrix with word frequency as an element and
Uni-gram count vectorizer
single word as a feature
Matrix with word frequency as an element and
Uni-gram + bi-gram count vectorizer
both single and paired words as features
Matrix with a combination of word frequency and
Uni-gram TF-IDF vectorizer inverse document frequency as an element and
single word as a feature
Matrix with combination of word frequency and
Uni-gram + bi-gram TF-IDF vectorizer inverse document frequency as an element and
both single and paired words as features
Uni-gram TF-IDF vectorizer with dimensionality
Uni-gram TF-IDF vectorizer with LSA
reduction
Uni-gram + bi-gram TF-IDF vectorizer with
Uni-gram + bi-gram TF-IDF vectorizer with LSA
dimensionality reduction
In the second approach, the set of features include text-based features such as structural features,
syntactic features, features quantifying sentiment behind the text, and meta-data. Both the review
content and review summary content are considered for the feature extraction process. The idea in
this approach is to find the set of significant features in addition to performing review
classification. Text preprocessing is not performed while extracting the structural and syntactic
features as it defies the whole purpose of extracting these features. For example, syntactic features
1
Refer chapter-2 of this paper for detailed explanation of vectorized features and LSA
28
include the count of punctuation marks and uppercase words. However, text preprocessing is
performed while extracting semantic features as the idea is to remove noisy words which do not
contribute much to the meaning of the text. Table 4-3 describes the features used in this approach.
Here, polarity denotes whether the text is negative, positive or neutral, whereas subjectivity
The third and final approach involves the use of word embedding based features. In this approach,
a text document is presented in a vector form by performing vector arithmetic on words it consists
of. The words are converted into vectors by using Word2vec algorithm which employs a shallow
neural network for training the word vectors (Mikolov et al., 2013). In this approach, a combination
of word embedding matrix and TF-IDF matrix is also tried. The idea behind using this combination
is that the Word2vec only gives the position of a word in a vector space but it does not denote how
29
namely, CBOW (Continuous bag of words) and skip-gram are also compared in this study. Table
It is imperative to scale the features once they are extracted. Since SVM is a distance-based matrix,
Hsu et al. (2003) recommend scaling features to the range [-1,1] or [0,1] in order to avoid the
domination of features with greater numeric ranges over the ones with smaller numeric ranges.
Scaling also helps in achieving faster optimization in the case of logistic regression as it uses the
gradient descent method for maximizing the likelihood function. In this study, min-max
normalization is used to scale the features in the range of [0,1]. Sparse features such as vectorized
features are not scaled as the sparsity is lost due to scaling (Pedregosa et al., 2011).
2
Refer chapter-2 of this thesis for detailed explanation on word embedding based features
30
This section discusses model validation using the train and test sets and sheds light on the model
This study uses the hold-out method to split the original dataset into train and test sets. The dataset
is shuffled and divided into train and test sets with the ratio 7:3 for all the approaches used in the
study. The models are trained using the samples of train set and its performance is measured by
applying the trained model on the unseen samples of the test set. The simple hold-out method was
preferred over the more sophisticated cross-validation method such as k-fold validation due to
computational constraints. The hold-out method was chosen with the assumption that the sample
size of the train set is sufficiently large to avoid the overfitting problem.
This study deals with the binary classification of the review texts. Hence, the potential performance
measures for this study include accuracy, precision, sensitivity, specificity, and area under ROC
classification can be evaluated using the confusion matrix which is shown in Table 4-6.
31
Predicted as Predicted as
Actual class
positive negative
True positive False negative
Positive
(tp) (fn)
False positive True negative
Negative
(fp) (tn)
This explanation of the parameters provides a basis for understanding the evaluation purposes of
different performance measures used for performing binary classification. Table 4-7 lists and
Since the goal of this study is to extract and label reviews based on their usefulness, it is not
affordable to misclassify the review of either class into its opposite class. Moreover, it is also
imperative to evaluate the overall performance of the classification models. Hence, the
performance measures used in this study are overall accuracy and AUC (area under the curve).
32
Table 4-7: Performance measures for a binary classifier (from Sokolova et al., 2009)
As it can be inferred from Table 4-7, the accuracy measure is defined as the ratio of correctly
classified samples to the total number of samples. It denotes the overall performance of the
classification model irrespective of the individual classes. Talking about the area under the ROC
curve, it tries to strike the balance between 2 performance measures namely, true positive rate and
false positive rate (James et al., 2013). True positive rate denotes the proportion of correctly
identified as positives among all positives, whereas false positive rate denotes the proportion of
incorrectly identified as positives among all negatives. The functionality of the ROC curve can be
Figure 4-5: ROC (Receiver Operating Characteristics) curve (from James et al., 2013)
In the ROC curve, the performance of a classifier is measured by the area under the curve. As can
be seen in Figure 4-5, the false positive rate also increases to increase the true positive rate. Hence,
the curve should bend at the top left corner for an ideal classifier (James et al., 2013). This helps
in increasing the true positive rate without increasing the false positive rate much. It also means
that the more the area under the curve, the better the performance of the classifier is. ROC curve
also helps in choosing the optimal threshold as the threshold can be varied along the curve to obtain
the true positive rate and false positive rate as per the problem in hand.
34
Chapter 5
Analysis and Results
This section discusses the analysis performed using the vectorized features. In this approach, the
features are obtained by representing the review text into a matrix form where rows denote the
unique reviews and columns denote the unique words used in the corpus of review texts. Two
variants of this matrix are used to perform the analysis which are – count vectorizer and TFIDF
vectorizer. In count vectorizer, matrix element denotes the frequency with which a word appears
in a particular review, whereas element of a TFIDF matrix denotes the value obtained by
multiplying the term frequency and inverse document frequency values of a word for that particular
review. Moreover, this analysis also involves considering a single word(unigram) or a pair of
considering the context words which could prove beneficial in uncovering the underlying semantic
relationships between the words. In this study, both unigrams and a combination of unigrams and
bigrams are analyzed and compared in terms of classification performance. Moreover, a TF-IDF
matrix obtained with a dimensionality reduction technique such as Latent Semantic Analysis
(LSA) is also included in this analysis which in the past has proven to be more efficient compared
The first step in the analysis is to perform text preprocessing in order to reduce dimensions of the
matrix by removing noisy words. The next step is to generate the train and test sets using random
shuffling with the ratio 7:3. This is followed by generating matrices in which matrix forms of
review texts are obtained using ‘sci-kit learn’ library of Python programming language. Moreover,
35
this library is also used throughout this study to implement classification algorithms along with
the hyperparameter tuning. Features are not scaled for sparse matrices as sparsity is lost after
scaling. However, it is performed when a reduced dimensional matrix is obtained using Latent
Semantic Analysis. Table 5-1 shows the hyperparameters used for the classification algorithms.
Algorithm Hyperparameters
Multinomial naïve Bayes classifier calculates the probability of a class given a document by
multiplying the posterior probabilities of the words appearing in that class. Hence, if it is used to
calculate the probability of a word which has never appeared before in that particular class of
documents, it assigns zero probability for that word which in turn assigns zero probability for that
particular class. To avoid this, a smoothing parameter with the value 1 is added to the word count
while calculating the probability. Talking about logistic regression and support vector machine,
the regularization parameter used is 1 and the penalty norm used is ‘L-1’ norm. ‘L-1’ norm is used
as it aids in reducing the dimensionality of the matrix by providing a sparse solution with very few
non-zero coefficients (Pedregosa et al., 2011). However, ‘L-2’ norm is used when applying LSA
as the dimensionality reduction, in this case, is done by LSA itself. Classifying with extremely
randomized trees classifier involves the use of 10 decision trees and the number of features
considered while splitting a node is calculated taking a square root of the total number of features
36
used in the study. Geurts et al. (2006) found in their study that calculating the number of features
in this manner tends to provide near-optimal accuracy and good computational complexity in
classification problems. Throughout this study, the hyperparameters for Multinomial naïve Bayes
and extremely randomized trees classifier will remain the same, whereas it will change for support
vector machine and logistic regression according to the requirements for dimensionality reduction.
Now, let us first discuss the analysis performed using the count vectorizer matrix. Table 5-2 shows
As it can be seen from Table 5-2, extremely randomized trees classifier performs better than the
conventional classifiers used for text classification in the case of count vectorizer with unigrams.
However, it does not perform better than the conventional classifiers in the case of high
dimensional count vectorizer containing both unigrams and bigrams. Moreover, the overall
Table 5-3 evaluates the performance of TF-IDF vectorizers. As in the case of count vectorizer,
extremely randomized trees classifier performs better than the conventional classifiers when only
unigrams are used but fails to do so when bigrams are combined with unigrams. Moreover,
combining bigrams with unigrams improves overall accuracy irrespective of the classifier used.
Uni-gram +
bigram Accuracy:51.37% Accuracy:57.42% Accuracy:57.59% Accuracy:69.98%
TF-IDF
vectorizer AUC: 0.5111 AUC: 0.5733 AUC: 0.5751 AUC: 0.6994
with LSA
38
As it can be inferred from Table 5-4, extremely randomized trees classifier outperforms the
conventional classifiers when LSA is applied on TF-IDF vectorizer. Moreover, accuracy does not
improve when TF-IDF vectorizer is used along with LSA. There is also not any difference between
the performances of TF-IDF vectorizer containing only unigrams and the one that contains both
ET SVM LR MNB
Figure 5-1 compares the performances of vectorized features. The score is calculated by taking the
average of overall accuracy and AUC measures. As it can be inferred, extremely randomized trees
classifier consistently performs well irrespective of the vectorizer used. Moreover, it outperforms
the conventional classifiers used for text classification when applied to matrices with
comparatively lower dimensions such as vectorizers with LSA and unigram vectorizers. However,
conventional classifiers perform as well as the extremely randomized trees classifier when it comes
to matrices with larger dimensions such as matrices with a combination of unigrams and bigrams.
39
This section outlines the analysis performed using the review and summary centric features which
include structural features, syntactic features, semantic features, and meta-data. The idea behind
performing this analysis is to come up with the list of features that contribute significantly towards
Talking about the analysis procedure, the features are extracted for both the review text and
summary text. For this purpose, 26 observations which do not contain any review summary text
are not taken into consideration. There are also some important points discussed below that need
1) Text preprocessing is not performed while extracting the structural and syntactic features
2) Text preprocessing is however performed while extracting the semantic features as the idea
is to discard words that do not contribute much to understand the semantics of the review
3) Syntactic features are not extracted for the review summary as there is very less scope due
Python programming language’s ‘NLTK’ and ‘TextBlob’ libraries are used to perform stop words
removal and lemmatization respectively. Moreover, ‘TextBlob’ library is also used to quantify the
semantic features extracted from both the review and review summary texts. The semantic features
used in this study are polarity and subjectivity of the review and summary texts. ‘TextBlob’
quantifies polarity in the range [-1,1] where ‘-1’ being ‘extremely negative’, ‘0’ being neutral, and
‘1’ being ‘extremely positive’. Subjectivity is quantified in the range [0,1] where ‘0’ being ‘highly
40
objective’ and ‘1’ being ‘highly subjective’. The next step is to generate train and test sets with the
ratio 7:3 with random shuffling. Features are then scaled using min-max scaling for all algorithms
Table 5-5 outlines the classification algorithms and their hyperparameters used to perform this
analysis.
Table 5-5: Classifiers and their hyperparameters for review and summary centric features
Classifier Hyperparameters
In this analysis, the feature selection for support vector machine and logistic regression is
transformer uses model’s characteristics such as regularization parameter and penalty function to
select significant features. To aid with the feature selection process, regularization parameter is set
to be 0.01 in order to avoid overfitting by reducing the penalty of misclassification and L-1 norm
is used as it helps in reducing the dimensionality of the features by producing sparse estimated
coefficients (Pedregosa et al., 2011). The feature selection is performed by discarding the features
which have their coefficients below a threshold value. The threshold values used in this analysis
is 1e-5.
41
For extremely randomized trees classifier, the feature importance is calculated by combining 2
measures - the fraction of samples a feature splits across all the trees and its contribution to
Table 5-6 compares the performances of the classification algorithms and also shows the important
As it can be inferred from Table 5-6, extremely randomized trees classifier outperforms the
conventional classifiers used for text classification. Talking about feature importance, the common
important features for all algorithms are review polarity, review subjectivity and summary
character count. Moreover, extremely randomized trees classifier also identifies review word
count, review character count, and average word length of the review as important features. Even
though SVM and logistic regression count review rating as an important feature, it is ranked last
This section discusses the analysis of the classification performed with the help of word-
embedding based features. This analysis includes 2 variants of Word2vec algorithm namely,
CBOW (Continuous bag of words) and skip-gram. Moreover, Word2vec is also combined with
the TF-IDF vectorizer to check whether it performs more efficiently than the Word2vec or TF-
IDF vectorizer alone. ‘Gensim’ library of the Python programming language is used for converting
problem specific decision. However, it was claimed that increasing vector dimensionality
improves performance until a point at which improvement is diminishing (Mikolov et al., 2013b).
Moreover, there is often some computational complexity associated with training the Word2vec
algorithm. In order to update the output word vector, one needs to iterate over every word in the
vocabulary for each training instance (Rong X., 2014). To deal with this problem, Mikolov et al.
(2013) introduced the negative sampling technique to choose a limited number of samples during
3
Even though extremely randomized trees classifier quantifies importance for every feature, only top 6 are shown
here.
43
each iteration to update the output word vector. It has been proven by experiments that the number
of these chosen samples can be 2 to 5 for large datasets and, 5 to 20 for small datasets. In their
study of analogical reasoning, Mikolov et al. (2013a) found that the process of selecting 15 samples
performed slightly better than the one with 5 samples. Moreover, the negative sampling technique
as a whole was found to perform better than its counterpart – hierarchical softmax.
The selection of hyperparameters for this study is based on the factors discussed above. Table 5-7
Hyperparameter Value
Dimensionality of word vector 300
Number of context words considered 5
Minimum word frequency required 5
Sample size used for negative sampling 15
The very first step of the analysis procedure involves performing text preprocessing, followed by
generation of the train and test set with the split ratio 7:3. In order to train the model, the train set
containing the review texts is converted into the list of lists format where each list includes tokens
of an individual review text. The next step is to build the model using the hyperparameters shown
in Table 5-7. As it can be inferred from Table 5-7, the word vectors will contain 300 elements after
the model is trained. Moreover, 5 context words are considered in predicting the current word in
CBOW variant of the model and, predicting the context words using the current word in the skip-
gram variant. To deal with the computational complexity problem, words with less than 5 counts
are discarded and only 15 words are considered while updating the weights of the output vector.
44
Two approaches are used in this study to come up with the vectors for individual review text. In
the first approach, the word vectors of a review text are added and then their average is calculated
to obtain a vector for the whole review text. The second approach multiplies the vector of a word
to the corresponding TF-IDF value and then takes an average to determine the vector for the review
text. This procedure is repeated for all the review texts of the training set and a matrix is obtained
with rows denoting the number of reviews and columns denoting the elements of the review vector.
This matrix along with the response variable provides a basis for performing review helpfulness
prediction by applying machine learning classification algorithms. The columns of this matrix are
scaled using min-max scaling for all algorithms except for extremely randomized trees. Table 5-8
Table 5-8: Classifiers and their hyperparameters for word embedding based features
Classifier Hyperparameters
Multinomial naïve Bayes Additive smoothing parameter = 1
Regularization parameter C = 1, penalty norm =
Logistic regression
𝐿2
Kernel= linear, regularization parameter C=1,
Support vector machine
penalty norm = 𝐿2
Number of trees = 10, features considered while
Extremely randomized trees
splitting a node = √Total number of features
Table 5-9 outlines the performance of the first approach. As it can be inferred from Table 5-9,
extremely randomized trees algorithm outperforms the conventional algorithms used for the text
classification purpose. The conventional algorithms perform slightly better than what would have
been achieved with a random guess. However, there is not any significant difference as far as the
performances of the features are considered. Word2vec with CBOW variant performs the same as
Now, let us discuss the performance of the second approach. Table 5-10 outlines the performance
As it can be inferred from Table 5-10, extremely randomized trees algorithm again outperforms
the conventional algorithms. Moreover, there is not much difference in term of performance
between CBOW and skip-gram variants. It should also be noted that combining word vectors with
TF-IDF values does not help much in improving the overall prediction performance.
46
This section compares the performances of all the features and classifiers used throughout the
study. This comparison provides a basis for determining the most feasible combination of feature
set and classifier that can be used in building an automated text classification system for predicting
Figure 5-2 shows the performance comparison of all the feature sets used throughout this study.
The X-axis denotes the performance score of an algorithm which is calculated by taking the
average of overall accuracy and AUC (area under the ROC curve) measures across all the 4
classifiers used in the study. The formula for calculating the score is shown below:
As it can be inferred from Figure 5-2, the overall best performance is achieved by vectorized
features which incorporate both unigrams and bigrams, followed by vectorized features with only
unigrams. Other feature sets perform slightly better than what would have been obtained by
random guessing with LSA based vectorizers and review and summary centric features performing
marginally better than the word embedding based features. Now, let us discuss the performance
Figure 5-3 shows the performance comparison of all the classifiers used in the study. The
performance score is calculated taking the average of overall accuracy and AUC measures. The
As it can be inferred from Figure 5-3, extremely randomized trees classifier performs consistently
well irrespective of the feature used. In most of the cases, it outperforms the conventional
classifiers used for text classification except in the case of vectorized features which have both
unigrams and bigrams as features. This justifies the popularity of using conventional classifiers in
high dimensional setup. The difference between the performances of extremely randomized trees
and the conventional classifiers is also marginal for the vectorized features with only unigrams.
49
Chapter 6
Conclusions and Future work
In this study, the focus was on building an automated text classification system which can predict
the helpfulness measure of an online review irrespective of the time of posting. The purpose of
this was to provide both consumers and manufacturers a wide variety of reviews to choose from
by including the most recent yet unvoted reviews in addition to higher voted old dated reviews.
The first step was to conduct a literature survey in order to get familiar with the work that had been
done in the domain of text classification including the work relevant to the review helpfulness
measurement. It was found that the work done regarding the review helpfulness focused mostly
on finding a correlation between review content-centric features and the helpfulness measure of
the review. It was also noticed that the word embedding based features had been used in the domain
of text classification other than that of review helpfulness. In this study, the comparison was made
between the vectorized features, review and review summary centric features, and word
embedding based features, which to the best of my knowledge, has never been attempted before.
Moreover, the conventional classifiers used for text classification such as Support vector machine,
Multinomial naïve Bayes, and Logistic regression were compared with Extremely randomized
trees - a decision tree-based ensemble classifier. In order to perform the prediction task using the
above discussed features and classifiers, the binary response variable was generated on the basis
of helpfulness votes obtained by a review. It was followed by text preprocessing and finally, the
models were trained and tested using hold-out validation along with accuracy and AUC scores.
50
feature set used. In most of the instances, it outperformed the conventional classifiers used
for text classification except in the case of vectorized features with unigrams and bigrams.
• Conventional classifiers performed better with sparse features such as count vectorizer and
TF-IDF vectorizer with unigrams and bigrams compared to the features with lesser
classifier which justifies the claims made by Joachims (1998) and Aggarwal et al. (2012)
regarding high suitability of conventional classifiers such as SVM for sparse text
classification problems.
• Among the features used, the overall best performance was obtained when the classifiers
were implemented using the vectorized features with unigrams and bigrams, followed by
vectorized features using LSA did not help much. Moreover, the results obtained using
other features were slightly better than those obtained by random guessing with the
summary and review centric features performing marginally better than word embedding
based features. Also, there was no difference between the performances of CBOW and
• Talking about the review and summary-centric features, the common significant features
identified by all classifiers were review polarity, review subjectivity and summary
character count. Even though the review score rating was identified as a significant feature
by SVM and Logistic regression, it was ranked last in feature importance by more accurate
Extremely randomized trees classifier. This complements the result obtained while
51
performing exploratory data analysis. The other significant features obtained by the
Extremely randomized trees classifiers were review character count, review word count,
and average word length of the review. This features along with review polarity and review
subjectivity complement the results obtained in the studies conducted by Kim et al.,
This research provides ample scope for potential future work and can be moved forward in multiple
directions. First of all, the validation procedure can be made more robust. Even though there was
enough data to train in the case of the hold-out validation method, the likelihood of overfitting can
further be reduced by using k-fold cross-validation. The overall best performance was achieved by
the vectorized features with unigrams and bigrams. It can be checked whether the performance
improves when tri-grams are also added to the combination of unigrams and bigrams, especially
in the case of conventional classifiers which are usually suitable for high dimensional sparse text
classification problems. Moreover, the overall best performing classifier was Extremely
randomized trees which is an ensemble of decision tree classifiers. It can be compared with other
decision tree-based ensemble classifiers such as Random forest and Gradient tree boosting, and
examine whether the performance can be enhanced. Word embedding based features performed
marginally better than random guessing. Hence, there is a need to implement a more sophisticated
technique which can convert review text into a vector form. The Paragraph Vector algorithm
proposed by Le et al., (2014) directly provides a document vector without manipulating the
document’s word vectors. This algorithm can be considered for future study as it has been shown
Appendix
Python code
import pandas as pd
import numpy as np
reviews=pd.read_csv("Reviews.csv")
#Removing reviews which have 'helpfulness denominator' value 0
rs=reviews[reviews.HelpfulnessDenominator!=0]
#Generating binary response variable
rs["Review_Helpfulness"]= np.where(rs['HelpfulnessNumerator']>1,1,0)
#Converting uppercase into lowercase
rs['Text'] = rs['Text'].apply(lambda x: " ".join(x.lower() for x in
x.split()))
#Removing punctuation marks
rs['Text'] = rs['Text'].str.replace('[^\w\s]','')
#Removing stopwords
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop += ['.','/','<','>','``','"','-','--']
rs['Text'] = rs['Text'].apply(lambda x: " ".join(x for x in x.split() if x
not in stop))
#Performing lemmatization
import textblob
from textblob import TextBlob
from textblob import Word
rs['Text'] = rs['Text'].apply(lambda x: " ".join([Word(word).lemmatize()
for word in x.split()]))
#Generating train and test sets
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(rs['Text'],
rs['Review_Helpfulness'],test_size=0.3,shuffle=True,random_state=42)
clflr=make_pipeline(CountVectorizer(stop_words='english',analyzer='word',n
gram_range=(1,2)),LogisticRegression(C=1,dual=False,penalty='l1',random_st
ate=42))
clflr.fit(train_x,train_y)
predicted=clflr.predict(valid_x)
print(accuracy_score(valid_y,predicted))
print(roc_auc_score(valid_y,predicted))
# Extremely randomized trees
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
clfetree=make_pipeline(CountVectorizer(stop_words='english',analyzer='word
',ngram_range=(1,2)),ExtraTreesClassifier(n_estimators=10,random_state=42)
)
clfetree.fit(train_x,train_y)
predicted=clfetree.predict(valid_x)
print(accuracy_score(valid_y,predicted))
print(roc_auc_score(valid_y,predicted))
clfetree_tfidf=make_pipeline(TfidfVectorizer(stop_words='english',analyzer
='word',ngram_range=(1,1),norm=None),ExtraTreesClassifier(n_estimators=10,
random_state=0))
clfetree_tfidf.fit(train_x,train_y)
predicted=clfetree_tfidf.predict(valid_x)
print(accuracy_score(valid_y,predicted))
print(roc_auc_score(valid_y,predicted))
import pandas as pd
import numpy as np
reviews=pd.read_csv("Reviews.csv")
#Removing reviews which have 'helpfulness denominator' value 0
rs=reviews[reviews.HelpfulnessDenominator!=0]
#Generating binary response variable
rs["Review_Helpfulness"]=np.where(rs['HelpfulnessNumerator']>1,1,0)
#Checking missing values for each attribute
rs.isna().sum()
#Removing null values of 'Summary' attribute
rsnew=rs.dropna()
58
w_uppercaseword_count','summary_word_count','summary_char_count','polarity
','subjectivity','summary_polarity','summary_subjectivity']]
train_scaled= min_max_scaler.fit_transform(train_x)
valid_scaled = min_max_scaler.transform(valid_x)
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
clflr=LogisticRegression(C=0.01,penalty='l1',dual=False,random_state=0).fi
t(train_scaled,train_y)
model=SelectFromModel(clflr,prefit=True)
X_new=model.transform(train_scaled)
selected_feature_names=X.columns[model.get_support()]
print(X_new.shape)
print(model.get_support())
print(selected_feature_names)
LR_new_features=features_df[['Score','review_word_count','summary_char_cou
nt','polarity','subjectivity','summary_polarity']]
train_x, valid_x, train_y, valid_y =
train_test_split(LR_new_features,y,test_size=0.3,shuffle=True,random_state
=42)
min_max_scaler= MinMaxScaler(copy=False)
train_scaled= min_max_scaler.fit_transform(train_x)
valid_scaled = min_max_scaler.transform(valid_x)
from sklearn.linear_model import LogisticRegression
clflr=LogisticRegression(C=0.01,penalty='l1',dual=False,random_state=42).f
it(train_scaled, train_y)
model=SelectFromModel(clflr,prefit=True)
X_new=model.transform(train_scaled)
selected_feature_names=LR_new_features.columns[model.get_support()]
predicted = clflr.predict(valid_scaled)
accuracy=np.mean(predicted==valid_y)
print(accuracy)
print(roc_auc_score(valid_y,predicted))
print(clflr.coef_)
print(model.get_support())
print(selected_feature_names)
import pandas as pd
reviews=pd.read_csv("Reviews.csv")
#Removing reviews which have 'helpfulness denominator' value 0
rs=reviews[reviews.HelpfulnessDenominator!=0]
#Generating binary response variable
rs["Review_Helpfulness"]=np.where(rs['HelpfulnessNumerator']>1,1,0)
#Converting uppercase into lowercase
rs['Text'] = rs['Text'].apply(lambda x: " ".join(x.lower() for x in
x.split()))
#Removing punctuations
rs['Text'] = rs['Text'].str.replace('[^\w\s]','')
#Removing stopwords
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop += ['.','/','<','>','``','"','-','--']
rs['Text'] = rs['Text'].apply(lambda x: " ".join(x for x in x.split() if x
not in stop))
#Performing lemmatization
import textblob
from textblob import TextBlob
from textblob import Word
rs['Text'] = rs['Text'].apply(lambda x: " ".join([Word(word).lemmatize()
for word in x.split()]))
#Generating train and test sets
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(rs['Text'],
rs['Review_Helpfulness'],test_size=0.3,shuffle=True,random_state=42)
import matplotlib.pyplot as plt
4
This code is inspired from the work done by Drozd (2016).
62
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
if len(word2vec) > 0:
self.dim = len(word2vec[next(iter(self.word2vec))])
else:
self.dim = 0
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
#Logistic regression
lr = Pipeline([("word2vec vectorizer", MeanEmbeddingVectorizer(w2v)),
("Min_max_Scaler",MinMaxScaler(copy=False)), ("LogisticReg",
LogisticRegression(C=1,penalty='l2',dual=False,random_state=42))])
score_acc_lr=accuracy_score(lr.fit(train_x,train_y).predict(valid_x),valid
_y)
print(score_acc_lr)
score_auc_lr=roc_auc_score(lr.fit(train_x,train_y).predict(valid_x),valid_
y)
print(score_auc_lr)
class TfidfEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.word2weight = None
if len(word2vec) > 0:
self.dim = len(word2vec[next(iter(self.word2vec))])
else:
self.dim = 0
#Logistic regression
lr = Pipeline([("word2vec vectorizer", TfidfEmbeddingVectorizer(w2v)),
("Min_max_Scaler",MinMaxScaler(copy=False)), ("LogisticReg",
LogisticRegression(C=1,penalty='l2',dual=False,random_state=42))])
score_acc_lr=accuracy_score(lr.fit(train_x,train_y).predict(valid_x),valid
_y)
print(score_acc_lr)
score_auc_lr=roc_auc_score(lr.fit(train_x,train_y).predict(valid_x),valid_
y)
print(score_auc_lr)
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
if len(word2vec) > 0:
self.dim = len(word2vec[next(iter(self.word2vec))])
65
else:
self.dim = 0
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
#Logistic regression
lr = Pipeline([("word2vec vectorizer", MeanEmbeddingVectorizer(w2v)),
("Min_max_Scaler",MinMaxScaler(copy=False)), ("LogisticReg",
LogisticRegression(C=1,penalty='l2',dual=False,random_state=42))])
score_acc_lr=accuracy_score(lr.fit(train_x,train_y).predict(valid_x),valid
_y)
print(score_acc_lr)
66
score_auc_lr=roc_auc_score(lr.fit(train_x,train_y).predict(valid_x),valid_
y)
print(score_auc_lr)
class TfidfEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.word2weight = None
if len(word2vec) > 0:
self.dim = len(word2vec[next(iter(self.word2vec))])
else:
self.dim = 0
#Logistic regression
lr = Pipeline([("word2vec vectorizer", TfidfEmbeddingVectorizer(w2v)),
("Min_max_Scaler",MinMaxScaler(copy=False)), ("LogisticReg",
LogisticRegression(C=1,penalty='l2',dual=False,random_state=42))])
score_acc_lr=accuracy_score(lr.fit(train_x,train_y).predict(valid_x),valid
_y)
print(score_acc_lr)
score_auc_lr=roc_auc_score(lr.fit(train_x,train_y).predict(valid_x),valid_
y)
print(score_auc_lr)
68
References
Aggarwal, C.C. and Zhai, C., 2012. A survey of text classification algorithms.
In Mining text data (pp. 163-222). Springer, Boston, MA.
Cao, Q., Duan, W. and Gan, Q., 2011. Exploring determinants of voting for
the “helpfulness” of online user reviews: A text mining approach. Decision
Support Systems, 50(2), pp.511-521.
Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N. and Al Najada,
H., 2015. Survey of review spam detection using machine learning
techniques. Journal of Big Data, 2(1), p.23.
Drozd, N., 2016. Text Classification with Word2Vec – DS lore. Retrieved
from http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-
word2vec/
Geurts, P., Ernst, D. and Wehenkel, L., 2006. Extremely randomized
trees. Machine learning, 63(1), pp.3-42.
Ghose, A. and Ipeirotis, P.G., 2007, August. Designing novel review ranking
systems: predicting the usefulness and impact of reviews. In Proceedings of
the ninth international conference on Electronic commerce (pp. 303-310).
ACM.
Hotho, A., Nürnberger, A. and Paaß, G., 2005, May. A brief survey of text
mining. In Ldv Forum (Vol. 20, No. 1, pp. 19-62).
Hsu, C.W., Chang, C.C. and Lin, C.J., 2003. A practical guide to support
vector classification.
James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduction to
statistical learning (Vol. 112, p. 18). New York: springer.
Joachims, T., 1998, April. Text categorization with support vector machines:
Learning with many relevant features. In European conference on machine
learning (pp. 137-142). Springer, Berlin, Heidelberg.
Kim, S.M., Pantel, P., Chklovski, T. and Pennacchiotti, M., 2006, July.
Automatically assessing review helpfulness. In Proceedings of the 2006
Conference on empirical methods in natural language processing (pp. 423-
430). Association for Computational Linguistics.
69
Landauer, T.K., Foltz, P.W. and Laham, D., 1998. An introduction to latent
semantic analysis. Discourse processes, 25(2-3), pp.259-284.
Le, Q. and Mikolov, T., 2014, January. Distributed representations of
sentences and documents. In International conference on machine
learning (pp. 1188-1196).
Lilleberg, J., Zhu, Y. and Zhang, Y., 2015, July. Support vector machines
and word2vec for text classification with semantic features. In 2015 IEEE
14th International Conference on Cognitive Informatics & Cognitive
Computing (ICCI* CC) (pp. 136-140). IEEE.
Liu, B. and Zhang, L., 2012. A survey of opinion mining and sentiment
analysis. In Mining text data (pp. 415-463). Springer, Boston, MA.
Liu, Y., Huang, X., An, A. and Yu, X., 2008, December. Modeling and
predicting the helpfulness of online reviews. In 2008 Eighth IEEE
international conference on data mining (pp. 443-452). IEEE.
Manning, C.D., Raghavan P., & Schütze, H., 2008. Introduction to
information retrieval. Cambridge University Press.
McAuley, J.J. and Leskovec, J., 2013, May. From amateurs to connoisseurs:
modeling the evolution of user expertise through online reviews.
In Proceedings of the 22nd international conference on World Wide
Web (pp. 897-908). ACM.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J., 2013a.
Distributed representations of words and phrases and their compositionality.
In Advances in neural information processing systems (pp. 3111-3119).
Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013b. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. and Vanderplas, J.,
2011. Scikit-learn: Machine learning in Python. Journal of machine learning
research, 12(Oct), pp.2825-2830.
70
Rehurek, R. and Sojka, P., 2010. Software framework for topic modelling
with large corpora. In Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks. pp. 45-50. Valletta, Malta, May 2010.
ELRA. http://is.muni.cz/publication/884893/en.
Rong, X., 2014. word2vec parameter learning explained. arXiv preprint
arXiv:1411.2738.
Salehan, M. and Kim, D.J., 2016. Predicting the performance of online
consumer reviews: A sentiment mining approach to big data
analytics. Decision Support Systems, 81, pp.30-40.
Shojaee, S., Murad, M.A.A., Azman, A.B., Sharef, N.M. and Nadali, S., 2013,
December. Detecting deceptive reviews using lexical and syntactic features.
In 2013 13th International Conference on Intellient Systems Design and
Applications (pp. 53-58). IEEE.
Sokolova, M. and Lapalme, G., 2009. A systematic analysis of performance
measures for classification tasks. Information Processing &
Management, 45(4), pp.427-437.
Thomo, A., 2009. Latent semantic analysis (Tutorial). Victoria, Canda, pp.1-
7.
Vinodhini, G. and Chandrasekaran, R.M., 2012. Sentiment analysis and
opinion mining: a survey. International Journal, 2(6), pp.282-292.