0% found this document useful (0 votes)
63 views11 pages

Classification of Shopify App User Reviews Using Novel Multi Text Features

App stores usually allow users to give reviews and ratings that are used by developers to resolve issues and make plans for their apps. In this way, these app stores collect large amounts of data for analysis. However, there are several challenges that must first be addressed, related to redundancy and the volume of data, by using machine learning. This study performs experiments on a dataset that contains reviews for Shopify apps.

Uploaded by

Ali Hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views11 pages

Classification of Shopify App User Reviews Using Novel Multi Text Features

App stores usually allow users to give reviews and ratings that are used by developers to resolve issues and make plans for their apps. In this way, these app stores collect large amounts of data for analysis. However, there are several challenges that must first be addressed, related to redundancy and the volume of data, by using machine learning. This study performs experiments on a dataset that contains reviews for Shopify apps.

Uploaded by

Ali Hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received December 26, 2019, accepted February 1, 2020, date of publication February 10, 2020, date of current version

February 18, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.2972632

Classification of Shopify App User Reviews


Using Novel Multi Text Features
FURQAN RUSTAM 1 , ARIF MEHMOOD 4 , MUHAMMAD AHMAD 2,3 ,
SALEEM ULLAH1 , DOST MUHAMMAD KHAN 4 , AND GYU SANG CHOI 5
1 Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan
2 Department of Computer Engineering, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan
3 Dipartimento di Matematica e Informatica—MIFT, University of Messina, 98121 Messina, Italy
4 Department of Computer Science and IT, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan
5 Department of Information and Communication Engineering, Yeungnam University, Gyeongsan 38541, South Korea

Corresponding authors: Arif Mehmood ([email protected]) and Gyu Sang Choi ([email protected])
This work was supported in part by the Ministry of Trade, Industry and Energy (MOTIE, South Korea) through the Industrial Technology
Innovation Program under Grant 10063130, and in part by the National Research Foundation of Korea (NRF) Grant funded by the Korean
Government (MSIT) under Grant NRF-2019R1A2C1006159.

ABSTRACT App stores usually allow users to give reviews and ratings that are used by developers to resolve
issues and make plans for their apps. In this way, these app stores collect large amounts of data for analysis.
However, there are several challenges that must first be addressed, related to redundancy and the volume
of data, by using machine learning. This study performs experiments on a dataset that contains reviews for
Shopify apps. To overcome the aforementioned limitations, we first categorize user reviews into two groups,
i.e., happy and unhappy, and then perform preprocessing on the reviews to clean the data. At a later stage,
several feature engineering techniques, such as bag-of-words, term frequency-inverse document frequency
(TF-IDF), and chi-square (Chi2), are used singly and in combination to preserve meaningful information.
Finally, the random forest, AdaBoost classifier, and logistic regression models are used to classify the reviews
as happy or unhappy. The performance of our proposed pipeline was evaluated using average accuracy,
precision, recall, and f1 score. The experiments reveal that a combination of features can improve machine
learning models performance and in this study, logistic regression outperforms the others and achieves an
83% true acceptance rate when combined with TF-IDF and Chi2.

INDEX TERMS Feature engineering, feature extraction, feature selection, machine learning, review
classification, text mining.

I. INTRODUCTION meaningful features using latent Dirichlet allocation to filter


Manufacturers always want to know the success rate of their the irrelevant reviews and achieved 59% precision and 51%
products/apps, and for that, they usually request users to recall rates.
provide feedback that is later used to analyze the impact The work [4] built a mobile app review analyzer that auto-
and quality of their products [1], [2]. However, it is critical matically extracts user requests or suggestions from reviews.
to analyze such feedback due to the volume and redun- This app analyzer works on the basis of linguistic rules to
dancy. Therefore, this work investigates an efficient way to extract requests from online reviews. The work [5] presented
analyze such feedback and solve the problems related to some probabilistic techniques for classifying app reviews.
the classification of Shopify app reviews. Though there are They classified these reviews into four categories: ratings,
many techniques that have already been proposed, such as bug reports, feature requests, and user experiences. They used
[3], which performed topic modeling to find high-level and multiple binary classifiers to classify reviews and achieve
acceptable results. The work [6] used different machine learn-
The associate editor coordinating the review of this manuscript and ing algorithms to solve app review classification problems.
approving it for publication was Yongming Li . They performed a comparative analysis of the results of

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
30234 VOLUME 8, 2020
F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

machine learning algorithms, such as naïve Bayes (NB), methods used in this study. Section IV contains the proposed
random forest (RF), decision tree (DT), support vector methodology, and Section V contains the results and discus-
machine (SVM), and logistic regression (LR). As per pre- sion. Finally, section VI concludes the paper with possible
vious work, the researchers tried to solve different prob- directions for future research.
lems during app review analysis. This study solved the clas-
sification problems of Shopify app reviews on the basis II. RELATED WORK
of the ratings given by app users and performed a com- As mentioned above, data classification is an area explored
parative analysis between tree-based ensemble and linear by many data scientists. Researchers have done much work in
models. the text classification domain, using different approaches and
More formally, machine learning algorithms take the users’ introducing some new techniques in this field. In this section,
reviews as input and then perform analysis on these reviews we discuss previous work on app review classification and
to predict whether the users are happy or not. This work analysis.
does not investigate other types of information, such as user The study [9] works on app review classification using
properties, app names, and app descriptions. We intentionally ensemble algorithms and techniques. The dataset used in the
limited the inputs (i.e., user reviews and rating scores) to keep study was previously examined in [3], the dataset contains
the problem definition simple. We used a dataset obtained reviews from Apple’s app store and the Google Play app store.
from Kaggle, on which preprocessing using the natural lan- In the study [9], the authors used NB, SVM, LR, and neu-
guage toolkit (NLTK) [7] has been done to clean the reviews. ral network (NN) in various combinations for classification.
Further preprocessing steps for cleaning the reviews included They built three ensemble algorithms A, B, and C. In ensem-
tokenization, punctuation removal, lower-case conversion, ble A, four classifiers, NB, SVM, LR, and NN, were grouped
removal of numeric values, and stopword removal. Finally, for final prediction; in ensemble B, three classifiers, SVM,
the stemming technique was used to get the root-form of each LR, and NN, were grouped, and in ensemble C, the two clas-
feature in the reviews. Rating scores from users (1 to 5) were sifiers NB and SVM were grouped. The best performers from
used to create two classes, i.e., happy and unhappy, where the these individual and ensembles algorithms were LR and NN.
users who gave rating scores of 3 or above were assigned as This study also used ensemble models, such as RF and AC,
happy, and the rest are unhappy. More details regarding this which work with numbers of base learners (decision trees) to
phenomenon can be found in section IV(B). make final predictions.
After preprocessing, two different text feature extraction In another research [4], text analysis was performed for
techniques, bag-of-words (BoW) and term frequency-inverse mobile app feature requests. They designed MARA (mobile
document frequency (TF-IDF), were deployed to extract app review analyzer), a prototype for automatic retrieval of
the high-level features. Later, the chi-square (Chi2) feature mobile app feature requests from online reviews. MARA
selection technique was used to select the most important takes review content as input for feature request mining.
and less redundant features to fit several machine learning The feature request mining algorithm uses a set of linguistic
models. rules, which are defined for supporting the identification of
Prior to fitting the model, the data must be split into two sentences that indicate such requests. The linear discriminant
parts, i.e., training and testing, in a 70% to 30% ratio. Finally, analyzer model was used to identify topics that can be asso-
several machine learning classifiers, i.e., AdaBoost classifier ciated with these requests in user reviews. They used true
(AC), LR, and RF, were used to classify the reviews as positive (TP), false positive (FP), true negative (TN), false
happy or unhappy. RF and AC are both tree-based ensemble negative (FN), precision, recall, and Matthews correlation
models, while LR is a statistical method to solve classification coefficient as evaluation metrics to check the accuracy of the
problems [8]. For this study, accuracy, precision, recall, and algorithm.
score metrics are used as performance evaluation metrics. The Researchers perform analysis on app reviews to facilitate
key points of this study are as follows: app developers in finding out whether their customers are
• Categorization of happy and unhappy users on the basis happy are not, which is also a goal of this study. In study [10],
of reviews and ratings researchers tried to help mobile app developers by performing
• Preprocessing techniques to clean text reviews for effi- analysis on user reviews to categorize information that is
cient learning of models, i.e., stemming, stopwords important for app maintenance and evolution. For classifi-
removal techniques, convert to lower case, punctuation, cation purposes, they deduced a taxonomy of user review
and numeric value removal technique categories that are relevant to app maintenance. The authors
• Feature Engineering techniques, i.e., BoW, TF-IDF, merged three techniques, natural language processing, text
Chi2 analysis, and sentiment analysis.
• Machine learning models, i.e., RF, AC, and LR By merging these techniques, they achieved desirable
• Comparative analysis of the performance of learning results in terms of precision and recall (Precision Score 74%
models with respect to feature engineering techniques and Recall Score 73%). They also applied these techniques
The rest of this paper is organized as follows: Section II individually to classify user reviews. In another study [11],
presents related work. Section III describes the material and the authors tried to extract the values of comparison scores

VOLUME 8, 2020 30235


F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

of sentiment reviews using different feature extraction tech- TABLE 2. Sample of data from dataset.
niques, such as word2vec, word2doc, and TF-IDF, with SVM,
NB, and decision tree algorithms. In study [11], the authors
used grid search algorithms for parameter optimization
of machine learning algorithms and feature extraction
methods.
LR performs significantly better in the case of classifica-
tion, but LR is usually preferred by researchers when there
is a binary classification problem. Study [12] used LR for TABLE 3. Number of samples corresponding to each rating score.

tweet classification with different feature engineering tech-


niques and achieved acceptable results. Similarly, the other
two ensemble models that were used in this study, RF and AC,
are also used in many fields of data mining. These tree-based
ensemble models perform well on text data and categorical
data. Study [13] used RF in its research to predict a chemical
compound’s quantitative or categorical biological activity
based on a quantitative description of the compound’s molec-
ular structure. They trained RF using six cheminformatics examples from each rating score to balance the dataset. For
datasets. Their analysis showed that RF is a powerful tool the experiment, a total of 12760 examples were used in this
to give good performance and accurate models. RF is an study, which is 2552 examples from each rating score.
ensemble model that creates trees using bootstrap samples of
training data.
B. DATASET VISUALIZATION
This section illustrates the make-up of the dataset graphically.
III. MATERIALS AND METHODS
A. DATA DESCRIPTION
This study used the Shopify app store dataset, which was
obtained from Kaggle, a well-known source for benchmark
datasets. The dataset contains 287467 examples and seven
variables about Shopify apps. This study used two of the
seven variables (reviews and ratings) for its experiments.
The dataset contains reviews for different categories of apps,
such as store design, finance, orders and shipping, marketing,
reporting, trust and security, finding and adding products,
inventory management, reporting, productivity, and places to
sell. In the reviews, users express their problems and issues
related to apps and also give ratings to apps from 1 to 5.
Table 1 shows the variables contained in the Shopify
app store dataset, and Table 2 shows examples from the
dataset used in this study. FIGURE 1. Apps per category.
The dataset contains rating scores from 1 to 5 correspond-
ing to each review, but in the dataset, all rating score ratios
are not equal, as shown in Table 3. Rating score 3 has the
lowest number of examples in the dataset, 2552, and rating
score 1 contains 246712 examples, which makes the dataset
highly imbalanced. To solve this problem, we extracted 2552

TABLE 1. Description of dataset variables.

FIGURE 2. Reviews per category.

30236 VOLUME 8, 2020


F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

music information retrieval [18]. TF-IDF assigns a weight


to each term in a document based on its term frequency
(TF) and inverse document frequency (IDF) [12], [19]. The
terms with higher weight scores are considered to be more
important [20]. TF-IDF computes weight of each term by
using formula as mention in equation 1:

N
Wi,j = TFi,j ( ) (1)
Df ,t

Here, TFi,j is the number of occurrences of term t in


FIGURE 3. Per rating score corresponding to user reviews on apps. document d, Df ,t is the number of documents containing the
term t, and N is the total number of documents in the corpus.

3) CHI2
Chi2 is the most common feature selection method, and it is
mostly used on text data [21]. In feature selection, we use
it to check whether the occurrence of a specific term and the
occurrence of a specific class are independent. More formally,
forgiven a document D, we estimate the following quantity
for each term and rank them by their score. Chi2 finds this
score using equation 2:

X X (Ne e − Ee e )2
FIGURE 4. After extracting 2552 examples from each rating score. X 2 (D, t, c) = t c t c
(2)
Eet ec
et ∈{0,1} ec ∈{0,1}

C. FEATURE ENGINEERING METHODS where


Feature engineering is a process for finding meaningful fea- • N is the observed frequency and E the expected fre-
tures from data for the efficient training of machine learning quency
algorithms or, in other words, the creation of features derived • et takes the value 1 if the document contains term t and
from original features [14]. The study [15] concludes that 0 otherwise
feature engineering can boost the performance of machine • ec takes the value 1 if the document is in class c and
learning algorithms. ‘‘Garbage in garbage out’’ is a common 0 otherwise
saying in machine learning. According to this idea, sense- For each feature (term), a corresponding high Chi2 score
less data produces senseless output. On the other hand, data indicates that the null hypothesis H0 of independence (mean-
that are more informational can produce desirable results. ing the document class has no influence over the term’s
Therefore, feature engineering can extract meaningful fea- frequency) should be rejected, and the occurrence of the term
tures from raw data, which helps to increase the consistency and class are dependent. In this case, we should select the
and accuracy of learning algorithms. In this study, we used feature for the text classification.
three feature engineering methods: BoW, TF-IDF, and Chi2.
D. SUPERVISED MACHINE LEARNING METHODS
1) BAG-OF-WORDS
In this section, we discuss the machine learning algorithms;
BoW is a method of extracting features from text data, we describe the implementation details of machine learning
and it is very easy to understand and implement. BoW is algorithms and their hyperparameters. Scikit-learn library
very useful in problems such as language modeling and text and NLTK were used for the implementation of machine
classification. In this method, we use CountVectorizer to learning algorithms [7], [22].
extract features. CountVectorizer works on term frequency, All three supervised machine learning algorithms men-
i.e., counting the occurrences of tokens and building a sparse tioned below were deployed in Python using the scikit mod-
matrix of tokens [16]. BoW is a collection of words and fea- ule. Supervised machine learning algorithms are commonly
tures, where each feature is assigned a value that represents used to solve classification and regression problems [23].
the occurrences of that feature [17]. RF and AC are tree-based algorithms, while LR is a linear
model. This study used three different machine-learning algo-
2) TF-IDF rithms to solve the classification problem. The implementa-
TF-IDF is a feature extraction method used to extract features tion details of the algorithms with their hyperparameters are
from data. TF-IDF is most widely used in text analysis and shown in Table 4.

VOLUME 8, 2020 30237


F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

TABLE 4. Machine learning algorithms and their hyperparameters. during the training of the model. By using these two hyper-
parameters, we achieved good results with RF in this study.

2) ADABOOST CLASSIFIER (AC)


This is an ensemble learning model that uses a boosting
method for training weak learners (decision trees). Adaboost
is an acronym for adaptive boosting. AC is very popu-
lar and possibly the most historically significant as it was
the first algorithm that could adapt to weak learners [33].
AC algorithm combines numbers of ‘‘weak learners’’
and trains them recursively on duplicates of the original
1) RANDOM FOREST (RF) dataset, while all weak learners focus on the difficult data
RF is a tree-based ensemble model that produces highly accu- points or outliers [31]. It is a meta-model that takes N copies
rate predictions by combining many weak learners (decision of weak learners and trains them on the same feature set
trees) [13]. This model uses the bagging technique to train but with different weights assigned to them. The study [34],
a number of decision trees using different bootstrap sam- eulogize the performance of AC with mathematical ground
ples [24]. In RF, a bootstrap sample is obtained by subsam- truth in their research. In many classification experiments,
pling the training dataset with replacement, where the size of the AC has been shown to outperform the other machine
a sample is the same as that of the training dataset [25]. RF learning algorithms [35]–[37]. For a detailed discussion of
and other classifiers that use decision trees in their prediction AC, see [38] section 2.
procedure apply the same techniques to construct decision This study implemented the AC algorithm with different
trees, and a major challenge in constructing them is the hyperparameters (see Table 4) and tuned these parameters
identification of the attribute for the root node at each level. to get high accuracy. Parameter ‘‘n_estimator’’ was used
This process is known as attribute selection [26]. with a value of 300, which means that the AC algorithm
In ensemble classification, several classifiers are trained, combined 300 week learners to make some predictions. The
and their results are combined through a voting process. difference between RF and AC ensemble learning is that
In the past, many researchers have proposed ensemble RF uses the bagging method, while AC uses the booting
methods [27]–[29]. The most popular ensemble methods are method, and it is exactly the weighted combination of N weak
bagging [25] and boosting [30], [31]. Bagging (or bootstrap learners. Another parameter that we use is ‘‘random_state,’’
aggregating) is a method in which many classifiers train this parameter defines the randomness of the samples during
on bootstrapped samples, which has been shown to reduce the training of the model. AC classification equation can be
the variance of the classification. RF can be defined as in represented as in equation 5:
equation 3 and 4:
T
X
p = mode {T1 (y), T2 (y), . . . , Tm (y)} (3) f (v) = sign( θt ft (v)) (5)
Xm t=1
p = mode{ Tm (y)} (4)
where
m=1
• ft is the nth weak learner
Here p is the final prediction by majority voting of decision • θt is the weight of nth weak learner
trees, while T1 (y), T2 (y), T3 (y), and Tm (y) are the number of
AC model identifies the outlier using high weight data
decision trees participating in the prediction procedure. For a
points, and the gradient boosting algorithm performs the same
more detailed discussion about the RF algorithm, see [24].
task using gradients in the loss function [31].
RF was implemented with 300 weak learners to get high
accuracy, which is the reason we set the n_estimator value
3) LOGISTIC REGRESSION (LR)
equal to 300. Parameter n_estimator defines the number of
trees that are contributing to prediction. In the experiment, LR is a statistical method for analyzing data in which there
RF trained 300 decision trees with bootstrap samples, and are one or more variables used to find the outcome. LR is the
the final prediction was made by the voting between all regression model that was used to estimate the probability
decision trees predictions [32]. The second parameter used of class members, so it is the best learning model to use
in RF was ‘‘max_depth’’ with a value of 60. This max_depth when the target variable is categorical. LR processes the
parameter was used to set the maximum depth level of each relationship between the categorical dependent variable and
decision tree. This max_depth parameter reduces complexity one or more independent variables by estimating probabilities
in the decision tree by setting the depth level and reduces using a logistic function. A logistic function or logistic curve
the chances of overfitting of the decision tree [25]. Another is a common "S" shaped, or sigmoid curve, as in equation 6:
parameter used for the RF algorithm was ‘‘random_state.’’ L
This parameter was used for the randomness of the samples f (x) = (6)
1 + e−m(v−vo )
30238 VOLUME 8, 2020
F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

where,
• e is the natural algorithm base (also known as Euler
Number).
• vo is the x-value of the sigmoid midpoint.
• L is the curve’s maximum value.
• m is the steepness of the curve.
For values of v in the domain of real numbers from −∞
to +∞, the S-curve of logistic function will be obtained,
with the graph of F approaching L as v approaches +∞ and
approaching zero as x approaches −∞. This study used the
liblinear algorithm for optimization because it works well
on small datasets, whereas ‘‘sag’’ and ‘‘saga’’ are faster for
large ones. The second parameter that was used in this study
with AC is ‘‘multi_class,’’ and we used it with the ‘‘ovr’’
value because it is good for binary classification. The third
parameter is ‘‘C.’’ This is the inverse regularization parameter
that holds the strength modification of regularization by being
inversely positioned to the Lambda regulator and reduces the
chance of overfitting the model [39]. LR models were used in
this study because LR is better for binary classification and
also effective for categorizing text [3], [40].
FIGURE 5. Methodology diagram: Green color represent the data flow
IV. METHODOLOGY while light blue color represents the techniques and methods.
In this section, we formulate the problem and the assumptions
and then describe the method and details of the techniques we In the next step of the experiment, review texts are cleaned
used to solve the user’s classification problem. by performing preprocessing steps, applying tokenization to
reviews, and then numeric values are removed from reviews.
A. PROBLEM STATEMENT Usually, numeric digits have no influence on the meaning
Assume that a company launches an app that helps its cus- of the text. In the next step of preprocessing, convert letters
tomers over the internet. Customers use this app and face to lower case and remove punctuation from text reviews,
some issues related to the app, which makes them uncom- because punctuation is not valuable for text analysis [12].
fortable. Customers want the company to solve these issues. Sentences might be more readable because of punctuation,
For this, the company gives the option for its customers to but it is difficult for a machine to differentiate punctua-
give reviews about the services or about any issues related tion from other characters. For that reason, punctuation was
to the app. Many customers give their reviews about the removed from the text during preprocessing. Then a stem-
company’s products or services. Then the company performs ming technique was applied to the reviews to get the root form
an analysis of these reviews to find the good things and the of each word using the PorterStemmer library [41]. At the end
bad things and tries to determine whether the customer is of preprocessing, stopwords are removed from text reviews
happy or not, which is very helpful for their business strategy. because they create confusion in text analysis. Table 6 shows
In other words, the problem is to predict whether the customer a sample data extract from the prepared dataset, and
is satisfied or not. As described in section I, to solve this Table 7 shows the results after preprocessing the sample data.
problem, we use text features of reviews and rating scores After preprocessing, we split the dataset into two subsets
as input. for training and testing. We divided the dataset in a ratio
of 70% to 30%. For training, use 70% of the data, and
for testing, use 30%. Feature engineering techniques (see
B. PROPOSED METHODOLOGY
section III(C)) were performed on both training and testing
This study uses different techniques to solve classification
problems, as shown in Figure 5. TABLE 5. Number of samples corresponding to each rating score.
Figure 5 illustrates the steps for solving the user’s classi-
fication problem. First, data goes through the preprocessing
phase. As discussed in section III, that dataset has ratings
from 1 to 5 that correspond to each review. The study places
the ratings into two classes, happy and unhappy. Ratings
equal to or greater than 3 are assigned to the happy class, and
ratings less than 3 are assigned to the unhappy class, as shown
in Table 5.

VOLUME 8, 2020 30239


F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

TABLE 6. Prepared dataset sample after conversion of rating into target C. EVALUATION CRITERIA
classes.
After all these steps, we come to the prediction phase. This
study used several evaluation metrics, which are accuracy,
f1 score, recall, and precision. These evaluation parameters
are used to evaluate machine learning models [43]. This study
also used confusion matrices to evaluate the performance
algorithms; a confusion matrix is a table that is mostly used
to describe the performance of a classifier on test data. It is
TABLE 7. Preprocessing of sample reviews. also known as an error matrix that allows visualization of the
performance of an algorithm.

1) ACCURACY
The accuracy score used to measure prediction correctness
for labels or target classes. This score’s highest value is 1,
and the lowest value is 0.
Number of correct predictions
Accuracy = (7)
Total number of predictions
For binary classification, accuracy can also be calculated in
sets to select and extract and important features from text terms of positives and negatives as follows:
reviews. The BoW and TF-IDF techniques, which are com-
monly used in text classification where the frequency of TP + TN
Accuracy = (8)
each word is used as a feature for training a classifier [42], TP + TN + FP + FN
were used for feature extraction. Three reviews were used as where
sample data (Table 8) to apply BoW and TF-IDF techniques,
• True Positives (TP): the model predicted happy (the
and the results are shown in Table 9 and 10 below.
user is happy), and the actual value is also happy.
TABLE 8. Sample of reviews.
• True Negatives (TN): the model predicted unhappy (the
user is not happy), and the actual value is also unhappy.
• False Positives (FP): the model predicted unhappy, but
the actual value is happy. (Also known as a ‘‘Type I
error.’’).
• False Negatives (FN): the model predicted happy, but
the actual value is unhappy. (Also known as a ‘‘Type II
error.’’).
TABLE 9. Results of BoW technique on preprocessed sample data.

2) RECALL
Recall is the completeness of our classifiers. Recall is the
number of true positives divided by the number of true pos-
itives plus the number of false negatives. The highest value
is 1, and the lowest value is 0.
TABLE 10. Results of TF-IDF technique on preprocessed sample data.
TP
Recall = (9)
TP + FN
3) PRECISION
Precision is the exactness of our classifiers. Precision is
Table 9 shows the frequency of each feature in the sample the number of true positives divided by the number of true
data. Table 10 shows the weight of each feature in the sample positives plus false positives. The highest value is 1, and the
data. The Chi2 feature selection technique was applied to the lowest value is 0.
results of BoW and TF-IDF to select important features from TP
the data. After feature engineering, machine learning models Precision = (10)
TP + FP
train using important features that were extracted by feature
engineering techniques. Machine learning models tune with
different hyperparameters as mentioned in Table 4. After 4) F1 SCORE
model training, test data was passed to the trained models to F1 score conveys the balance between the precision and the
evaluate the performance of the learning model. recall; in other words, the f1 score is the harmonic mean

30240 VOLUME 8, 2020


F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

between precision and recall. Like the other scores, f1 has the
same range of values from 1 to 0.
Precsicion ∗ Recall
F1 Score = 2 ∗ (11)
Precsicion + Recal
This study used the parameters mentioned above to evalu-
ate the performance of all algorithms. We compare algorithm
accuracies and propose the best performer on the Shopify
data.

V. RESULTS AND DISCUSSION


This section will discuss the result of the experiments
for solving the classification problem. This study used
different techniques, especially for feature selection. The
Chi2 technique was used, which gave an improvement in
our experiment results. We compare the results of two tree-
based ensemble algorithms, RF, and AC, with a statistical
algorithm, LR. As mentioned above, four popular evaluation
scores were used to compare machine learning algorithms
(see section IV(C)). This study set up different experiments
to evaluate the leaning models by using features extraction
techniques on the Shopify app dataset. In the first approach,
we applied both feature extraction techniques (BoW, TF-IDF)
with all machine learning algorithms to classify happy and
unhappy users. The results are shown in Tables 11 and 12,
where bold font illustrates the highest scores.
FIGURE 6. In this figure, confusion matrices of RF, LR, and AC are shown
using the BoW and TF-IDF techniques. Here 0 represents the unhappy
TABLE 11. Machine learning models results with BoW. class whereas 1 represents the happy class.

TABLE 13. Machine learning models results with (BoW + Chi2).

TABLE 12. Machine learning models results with TF-IDF.


TABLE 14. Machine learning model results with (TF-IDF + Chi2).

The accuracy scores of the learning models show that LR


performs very well on the test dataset. LR gave high results features. These techniques also gave us some better results
in terms of f1 score with respect to both feature extraction with LR, the results in Table 13 and 14 show that LR also
techniques (BoW and TF-IDF). RF achieves the highest performs well under the (BoW + Chi2) and (TF-IDF + Chi2)
recall score with both feature extraction techniques, while feature engineering techniques. Again, bold font indicates the
the AC also achieves acceptable results. According to the highest score values.
confusion matrix, LR with TF-IDF achieved the best result Results with (BoW + Chi2) are lower than (TF-IDF +
and also achieved the highest TP and TN rates. LR gives Chi2) because the combination of TF-IDF and Chi2 gives
3098 (1946 happy & 1152 unhappy) correct predictions and more valuable features to learning models for fitting.
723 (306 happy & 417 unhappy) wrong predictions against LR gives more accurate predictions for happy and unhappy
3821 (2252 happy & 1569 unhappy) test examples with users with both feature engineering techniques. According to
TF-IDF, which is higher than the others, as shown in the the confusion matrix (Figure 7), LR with the combination of
confusion matrix in Figure 6. The average accuracy for happy TF-IDF and Chi2 gives 3186 (1992 happy & 1194 unhappy)
class is 86% and unhappy class is upto 73%. By changing correct predictions and 635 (260 happy & 375 unhappy)
feature engineering techniques, such as (BoW + Chi2) and wrong predictions out of 3821 (2252 happy & 1569 unhappy)
(TF-IDF + Chi2) learning models improved their results. predictions with an average accruacy for happy class 88%
BoW was used to extract the features from text, and Chi2 was and unhappy class upto 76%. These results clearly showing
used to select the important features from the extracted the increase of correct predictions and decrease of wrong

VOLUME 8, 2020 30241


F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

FIGURE 9. Comparison between accuracy, precision, recall, and f1 score


of all learning models using the TF-IDF technique.

FIGURE 10. Comparison between accuracy, precision, recall, and f1 score


of all learning models using the (BoW + Chi2) technique.
FIGURE 7. In this figure, confusion matrices of RF, LR, and AC are shown
when using the (BoW+Chi2) and (TF-IDF+Chi2) techniques. Here
0 represents the unhappy class whereas 1 represents the happy class.

predictions by learning models when we use a combination


of features. In this study, performance was measured by
accuracy, precision, recall, and f1 score for all approaches, but
specifically, we use f1 score to compare the performance of
machine learning models because this score is mostly used for
measuring the performance of supervised machine learning
algorithms [44]. LR tops the table by achieving the highest
scores, as illustrated in Figures 8, 9, 10, 11 and 12.
FIGURE 11. Comparison between accuracy, precision, recall, and f1 score
of all learning models using the (TF-IDF + Chi2) technique.

unhappy users), and according to the situation and problem,


LR performs better than RF and AC because LR performs
better when the prediction probability ranges from 0 to 1.
In other words, LR performs well in binary classification
problems with its ‘‘multi_class= ovr’’ parameter, as in our
binary classification problem. AC also achieved acceptable
results on text data because of its boosting method. The
boosting method produces highly accurate predictions by
combining many weak learners [45], but the boosting method
FIGURE 8. Comparison between accuracy, precision, recall, and f1 score leads toward overfitting the classifiers [31]. RF in our experi-
of all learning models using the BoW technique. ments achieved the highest recall score because it is ensemble
learning in which using multiple decision trees often per-
According to our comparisons, LR performed very well forms better than single decision trees in classification tasks
with all feature engineering techniques and beat the ensemble [9], [46]. That is the reason we used an ensemble model in
learning models during the classification of users. As men- our experiments. Experimental results also show that features
tioned above, this study had binary target classes (happy and engineering techniques are effective in machine learning, and

30242 VOLUME 8, 2020


F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

[3] E. Guzman and W. Maalej, ‘‘How do users like this feature? A fine grained
sentiment analysis of app reviews,’’ in Proc. IEEE 22nd Int. Requirements
Eng. Conf. (RE), Aug. 2014, pp. 153–162.
[4] C. Iacob and R. Harrison, ‘‘Retrieving and analyzing mobile apps feature
requests from online reviews,’’ in Proc. 10th Work. Conf. Mining Softw.
Repositories (MSR), May 2013, pp. 41–44.
[5] W. Maalej and H. Nabil, ‘‘Bug report, feature request, or simply praise?
On automatically classifying app reviews,’’ in Proc. IEEE 23rd Int.
Requirements Eng. Conf. (RE), Aug. 2015, pp. 116–125.
[6] T. Pranckevičius and V. Marcinkevičius, ‘‘Comparison of naive Bayes, ran-
dom forest, decision tree, support vector machines, and logistic regression
classifiers for text reviews classification,’’ Baltic J. Mod. Comput., vol. 5,
no. 2, pp. 221–232, 2017.
[7] E. Loper and S. Bird, ‘‘NLTK: The natural language toolkit,’’ 2002,
arXiv:cs/0205028. [Online]. Available: https://arxiv.org/abs/cs/0205028
[8] J.-H. Xue and D. M. Titterington, ‘‘Comment on ‘on discriminative vs. gen-
erative classifiers: A comparison of logistic regression and naive Bayes,’’’
Neural Process. Lett., vol. 28, no. 3, pp. 169–187, Dec. 2008.
[9] E. Guzman, M. El-Haliby, and B. Bruegge, ‘‘Ensemble methods for
app review classification: An approach for software evolution (N),’’ in
Proc. 30th IEEE/ACM Int. Conf. Automat. Softw. Eng. (ASE), Nov. 2015,
FIGURE 12. Comparison between f1 scores of machine learning models pp. 771–776.
using all feature engineering techniques. [10] S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora, and
H. C. Gall, ‘‘How can I improve my app? Classifying user reviews for
software maintenance and evolution,’’ in Proc. IEEE Int. Conf. Softw.
these techniques are helpful in improving the accuracy of Maintenance Evol. (ICSME), Sep. 2015, pp. 281–290.
learning models. All learning models achieve their desired [11] S. M. Isa, R. Suwandi, and Y. P. Andrean, Optimizing the Hyperparameter
results because feature selection techniques give us important of Feature Extraction and Machine Learning Classification Algorithms.
London, U.K.: The Science and Information Organization, 2019.
features from the extracted features, which is an effective way [12] F. Rustam, I. Ashraf, A. Mehmood, S. Ullah, and G. Choi, ‘‘Tweets clas-
to increase the accuracy of learning models. sification on the base of sentiments for US airline companies,’’ Entropy,
The performance of learning models without feature selec- vol. 21, no. 11, p. 1078, Nov. 2019.
[13] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and
tion techniques (Chi2) and with simple feature extraction B. P. Feuston, ‘‘Random forest: A classification and regression tool for
techniques, such as BoW and TF-IDF, is also acceptable. As a compound classification and QSAR modeling,’’ J. Chem. Inf. Comput. Sci.,
comparison between learning models, LR wins the race in vol. 43, no. 6, pp. 1947–1958, Nov. 2003.
[14] F. F. Bocca and L. H. A. Rodrigues, ‘‘The effect of tuning, feature engi-
all situations to solve user’s classification problems. One can neering, and feature selection in data mining applied to rainfed sugar-
deploy the Wilcoxon test to further validate the performance cane yield modelling,’’ Comput. Electron. Agricult., vol. 128, pp. 67–76,
of our porposed model is statistically significant. Oct. 2016.
[15] J. Heaton, ‘‘An empirical analysis of feature engineering for predictive
modeling,’’ in Proc. SoutheastCon, Mar. 2016, pp. 1–6.
VI. CONCLUSION [16] S. C. Eshan and M. S. Hasan, ‘‘An application of machine learning to
detect abusive Bengali text,’’ in Proc. 20th Int. Conf. Comput. Inf. Technol.
This study, for the very first time to the best of our knowledge, (ICCIT), Dec. 2017, pp. 1–6.
exploits the use of different machine learning approaches [17] X. Hu, J. S. Downie, and A. F. Ehmann, ‘‘Lyric text mining in music mood
to solve user’s review classification problems based on classification,’’ Amer. Music, vol. 183, no. 5, 049, pp. 2–209, 2009.
[18] B. Yu, ‘‘An evaluation of text classification methods for literary study,’’
different feature engineering techniques, such as BoW, Literary Linguistic Comput., vol. 23, no. 3, pp. 327–343, Sep. 2008.
TF-IDF, and Chi2. The classifiers (RF, LR, and AC) were [19] S. Robertson, ‘‘Understanding inverse document frequency: On theoretical
trained on text reviews to predict the user’s review as being arguments for IDF,’’ J. Document., vol. 60, no. 5, pp. 503–520, Oct. 2004.
[20] W. Zhang, T. Yoshida, and X. Tang, ‘‘A comparative study of TF∗ IDF, LSI
happy or unhappy for Shopify apps only. The comparative and multi-words for text classification,’’ Expert Syst. Appl., vol. 38, no. 3,
analysis reveals that LR outperformed in the case of using pp. 2758–2765, Mar. 2011.
TF-IDF and Chi2 together. We end the conclusion by pointing [21] P. Meesad, P. Boonrawd, and V. Nuipian, ‘‘A chi-square-test for word
importance differentiation in text classification,’’ in Proc. Int. Conf. Inf.
out that the results and conclusion of our experiments are Electron. Eng., 2011, pp. 110–114.
based on a single dataset (the Shopify app dataset), which [22] S. Bird, ‘‘NLTK-Lite: Efficient scripting for natural language processing,’’
was never used before for classification purposes, and these in Proc. 4th Int. Conf. Natural Lang. Process. (ICON), 2005, pp. 11–18.
[23] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, ‘‘Supervised machine learn-
algorithms have not yet been tested on other datasets. It is ing: A review of classification techniques,’’ Emerg. Artif. Intell. Appl.
possible that our results are specific to the dataset being used. Comput. Eng., vol. 160, pp. 3–24, 2007.
Our future work entails testing deep machine learning models [24] G. Biau and E. Scornet, ‘‘A random forest guided tour,’’ TEST, vol. 25,
no. 2, pp. 197–227, Jun. 2016.
on different text and categorical datasets for the purpose of [25] L. Breiman, ‘‘Bagging predictors,’’ Mach. Learn., vol. 24, no. 2,
user review classification. pp. 123–140, Aug. 1996.
[26] A. Liaw and M. Wiener, ‘‘Classification and regression by randomforest,’’
R News, vol. 2, no. 3, pp. 18–22, 2002.
REFERENCES [27] J. Benediktsson and P. Swain, ‘‘Consensus theoretic classification meth-
[1] S. R. Das and M. Y. Chen, ‘‘Yahoo! For Amazon: Sentiment extraction ods,’’ IEEE Trans. Syst., Man, Cybern., vol. 22, no. 4, pp. 688–704, 1992.
from small talk on the Web,’’ Manage. Sci., vol. 53, no. 9, pp. 1375–1388, [28] L. K. Hansen and P. Salamon, ‘‘Neural network ensembles,’’ IEEE Trans.
Sep. 2007. Pattern Anal. Mach. Intell., no. 10, pp. 993–1001, Oct. 1990.
[2] J. Chevalier and D. Mayzlin, ‘‘The effect of word of mouth on sales: [29] L. I. Kuncheva, ‘‘That elusive diversity in classifier ensembles,’’ in Proc.
Online book reviews,’’ Nat. Bureau Economic Res., Cambridge, MA, Iberian Conf. Pattern Recognit. Image Anal. Berlin, Germany: Springer,
USA, Tech. Rep. w10148, 2003. 2003, pp. 1126–1138.

VOLUME 8, 2020 30243


F. Rustam et al.: Classification of Shopify App User Reviews Using Novel Multi Text Features

[30] R. E. Schapire, ‘‘A brief introduction to boosting,’’ in Proc. IJCAI, vol. 99, MUHAMMAD AHMAD is currently an Assis-
1999, pp. 1401–1406. tant Professor with the Department of Com-
[31] Y. Freund and R. E. Schapire, ‘‘A decision-theoretic generalization of on- puter Engineering, Khwaja Fareed University
line learning and an application to boosting,’’ J. Comput. Syst. Sci., vol. 55, of Engineering and Information Technology,
no. 1, pp. 119–139, Aug. 1997. Pakistan. He is also associated with the Research
[32] D. S. Palmer, N. M. O’Boyle, R. C. Glen, and J. B. O. Mitchell, ‘‘Random Group, Advanced Image Processing Research Lab
forest models to predict aqueous solubility,’’ J. Chem. Inf. Model., vol. 47,
(AIPRL), the First Hyperspectral Imaging Lab,
no. 1, pp. 150–158, Jan. 2007.
[33] Y. Zhang, H. Zhang, J. Cai, and B. Yang, ‘‘A weighted voting classifier
Pakistan. He is also associated with the Univer-
based on differential evolution,’’ Abstract Appl. Anal., vol. 2014, 2014. sity of Messina, Messina, Italy, as a Research
[34] J. Belanich and L. E. Ortiz, ‘‘On the convergence properties of optimal Fellow. He has authored a number of research
Adaboost,’’ 2012, arXiv:1212.1108. https://arxiv.org/abs/1212.1108 articles in reputed journals and conferences. He is a Regular Reviewer of
[35] I. Sevilla-Noarbe and P. Etayo-Sotos, ‘‘Effect of training characteristics on Springer Nature journals, NCAA, the IEEE (TIE, TNNLS, TGRS, TIP,
object classification: An application using boosted decision trees,’’ Astron. GRSL, GRSM, JSTARS, TMC, the TRANSACTIONS ON MULTIMEDIA, ACCESS,
Comput., vol. 11, pp. 64–72, Jun. 2015. COMPUTERS, SENSORS, the TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND
[36] F. Elorrieta, S. Eyheramendy, A. Jordán, I. Dékány, M. Catelan, NETWORKING), MDPI Journals, Optik, Measurement Science and Technology,
R. Angeloni, J. Alonso-García, R. Contreras-Ramos, F. Gran, and IET Journals, and the Transactions on Internet and Information Systems. His
G. Hajdu, ‘‘A machine learned classifier for RR Lyrae in the VVV survey,’’ current research interests include machine learning, computer vision, remote
Astron. Astrophys., vol. 595, p. A82, Nov. 2016. sensing, hyperspectral imaging, and wearable computing.
[37] R. Zitlau, B. Hoyle, K. Paech, J. Weller, M. M. Rau, and S. Seitz, ‘‘Stacking
for machine learning redshifts applied to SDSS galaxies,’’ Monthly Notices
Roy. Astronomical Soc., vol. 460, no. 3, pp. 3152–3162, 2016.
[38] A. Mayr, H. Binder, O. Gefeller, and M. Schmid, ‘‘The evolution of
boosting algorithms,’’ Methods Inf. Med., vol. 53, no. 06, pp. 419–427,
2014. SALEEM ULLAH was born in Ahmedpur East,
[39] G. Grégoire, ‘‘Multiple linear regression,’’ in European Astronomical Soci- Pakistan, in 1983. He received the B.Sc. degree
ety, vol. 66. Les Ulis, France: EDP Sciences, 2014, pp. 45–72. in computer science from Islamia University
[40] F. Sebastiani, ‘‘Machine learning in automated text categorization,’’ ACM Bahawalpur, Pakistan, in 2003, the M.I.T. degree
Comput. Surv., vol. 34, no. 1, pp. 1–47, 2002. in computer science from Bahauddin Zakariya
[41] B. Issac and W. J. Jap, ‘‘Implementing spam detection using Bayesian University, Multan, in 2005, and the Ph.D. degree
and porter stemmer keyword stripping approaches,’’ in Proc. IEEE Region from Chongqing University, China, in 2012.
Conf. (TENCON), Jan. 2009, pp. 1–5. From 2006 to 2009, he worked as a Network/
[42] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, ‘‘Bag of tricks for IT Administrator in different companies. From
efficient text classification,’’ 2016, arXiv:1607.01759. [Online]. Available: August 2012 to February 2016, he worked as an
https://arxiv.org/abs/1607.01759
Assistant Professor with Islamia University Bahawalpur. He has been work-
[43] Y. J. Huang, R. Powers, and G. T. Montelione, ‘‘Protein NMR recall, pre-
ing as an Associate Professor with the Khwaja Fareed University of Engi-
cision, and F-measure scores (RPF scores): Structure quality assessment
measures based on information retrieval statistics,’’ J. Amer. Chem. Soc., neering and Information Technology, Rahim Yar Khan, since February 2016.
vol. 127, no. 6, pp. 1665–1674, 2005. He has almost 13 years of Industry experience in the field of IT. He is also
[44] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ‘‘Squad: 100,000+ an active Researcher in the field of adhoc networks, congestion control, and
questions for machine comprehension of text,’’ 2016, arXiv:1606.05250. security.
[Online]. Available: https://arxiv.org/abs/1606.05250
[45] S. Mathanker, P. Weckler, T. Bowser, N. Wang, and N. Maness, ‘‘Adaboost
classifiers for pecan defect classification,’’ Comput. Electron. Agricult.,
vol. 77, no. 1, pp. 60–68, 2011.
[46] A. C. Tan and D. Gilbert, ‘‘Ensemble machine learning on gene expression DOST MUHAMMAD KHAN received the M.Sc.
data for cancer classification,’’ Appl. Bioinf., vol. 2, no. 3, pp. S75–83, degree (Hons.) in computer science from Bahaud-
2003. din Zakariya University (BZU), Multan, in 1990,
and the Ph.D. degree from the School of Innovative
FURQAN RUSTAM received the M.C.S. degree Technologies and Engineering (SITE), University
from the Department of Computer Science, of Technology, Mauritius (UTM), in 2013. He is
Islamia University of Bahawalpur, Pakistan, currently working as an Assistant Professor with
in October 2017. He is currently pursuing the the Department of Computer Science and IT, The
master’s degree in computer science with the Islamia University of Bahawalpur, Pakistan. His
Department of Computer Science, Khwaja Fareed areas of research are data mining and data mining
University of Engineering and Information Tech- techniques, multiagent system (MAS), object-oriented data base systems,
nology (KFUEIT), Rahim Yar Khan, Pakistan. He and formal methods in software engineering.
is also serving as a Research Assistant with the
Fareed Computing and Research Center, KFUEIT.
His recent research interests are related to data mining, mainly working
machine learning and deep learning-based IoT, and text mining tasks.
GYU SANG CHOI received the Ph.D. degree
ARIF MEHMOOD received the Ph.D. degree from the Department of Computer Science and
from the Department of Information and Com- Engineering, Pennsylvania State University, Uni-
munication Engineering, Yeungnam University, versity Park, PA, USA, in 2005. He was a Research
South Korea, in November 2017. He is cur- Staff Member with the Samsung Advanced Insti-
rently working as an Assistant Professor with the tute of Technology (SAIT) for Samsung Elec-
Department of Computer Science and IT, The tronics, from 2006 to 2009. Since 2009, he has
Islamia University of Bahawalpur, Pakistan. His been a Faculty Member with the Department of
recent research interests are related to data min- Information and Communication, Yeungnam Uni-
ing, mainly working on AI and deep learning- versity, South Korea. His research areas include
based text mining, and data science management non-volatile memory and storage systems.
technologies.

30244 VOLUME 8, 2020

You might also like