Zuo - Sentiment Analysis of Steam Review Datasets Using Naive Bayes and Decision Tree Classifier (2018)
Zuo - Sentiment Analysis of Steam Review Datasets Using Naive Bayes and Decision Tree Classifier (2018)
B. Data Processing
1) Remove special characters and digits: Punctuations
were removed first. Even though the majority of reviews are
in English, there are still a small portion of reviews are in
Russia, Chinese, and other languages. Special characters and
digits were removed by only keeping the lower and upper
cases of English letters.
2) Lower case: From the last step, all texts other than
letters have been removed. Then it is necessary to transform
all the letters appeared in the reviews to lower case so as to
reduce the size of words.
3) Remove stop word: The stop words lists are from
Fig. 2: Interactive dashboard for Players every day for Dota 2 in
past three years NLTK package in Python. Other stop words are selected
manually based on the world cloud plot. For example, ”get”
is not a word in the original NLTK package but actually
2) Data Scrapping: In Github, some scrapper were devel- appeared lots of times in the text and should be safely
oped to scrap reviews data directly from the official website. ignored, it is treated as a stop word.
Most of scrappers were built with Scrapy in Python. Both 4) Stemming: Stemming was used to reduce the size of
products information and user reviews information can be words. The stemmer used is Porter Stemmer in NLTK.
scrapped. The advantage of this method is the reviews are 5) Remove links: Links are removed because links may
most complete and updated. The disadvantage is very time not include sentiment information. Links are removed with
consuming. Error may occur if the scrapping speed or the regular expression. If a string matches the form ”http://”, it
number of scrappers exceed the maximum limits. However, is removed.
the maximum limits are not indicated clearly. The scrapping 6) Remove most frequently words: The threshold N is
process is done by iterations. 20 reviews can be found from defined as if a word appeared more than N reviews, this
each iteration. However, one error in one iteration can lead word is removed. Words appear too frequently may contain
to the fails of all the subsequent iterations. Moreover, it may less information for sentiment analysis. The threshold N is
take a few days to scrap all reviews from Steam using this chosen with cross validation.
method. 7) Remove most infrequently words: The threshold N is
defined as if a word appeared less than N reviews, this
3) API: This is the finally chosen method. The first step
word is removed. Words appear too infrequently may contain
is to get all IDs and names for all products and the IDs were
less information for sentiment analysis. Most of the time
saved to a list. The second step is to pull all the product
those words are misspelled or crated by a user. Such as
information with a product API by each ID obtained from
”Yaaaaaaaaaaaaaaaaaaaaaa”. The threshold N is chosen with
the step 1. The file obtained from API are json files. It is
cross validation.
saved after transforming to a csv data table. The next step is
8) Correct misspelled word: Since there are some mis-
using the ID again with a review API by each ID to get all the
spelled words, methods have been attempted to correct them.
reviews. Each product generates one json file. By merging all
Packages in Python such as Hunspell and autocorrect have
the reviews together, a csv file is retained for analysis. The
been tried. Packages such as PyEnchant can also help to
reviews got from API are not in a real time manner but they
check whether a word exists in English. However, there are
are updated regularly. Additionally, user information can be
two reasons to explain why those methods may not be very
download with user information API. An API key is needed
effective: (1) All of them are time-consuming. The basic idea
to use that API. There are lots of missing values so they are
of Hunspell and autocorrect are calculating the distance of
removed from the final dataset.
possible misspelled worlds and correct words in a dictionary
based on some distance function then output the most similar E. Information Gain
one . The whole process will take a long time for a large Information Gain (IG) is a measure of information gained
amount of text since it takes some time to compute each word (in bits) for classifying a text document by evaluating the
in it. And it may still not reduce much word size compared presence or absence of a feature in a text document. A useful
lower case and stemming. (2) Now automatic correcting tools feature should decrease more entropy than a useless feature.
are almost everywhere. Most of the input method editors For a binary classification problem, the entropy of a partition
(IME) can underline the misspelled words with a red line. D is given by
As a result, the remaining majority of misspelled words may
2
be misspelled on purpose, or in a more complex case, be X
Inf o(D) = − (Pi )log2 (Pi )
automatically corrected to another correct word by input
i=1
method editor. (3) All of those methods do not consider
the context. The correction is not accurate by ignoring the where pi is the proportion of instances of each class or cat-
context. So finally, this step is skipped. egory. For example, P1 represents the proportion of positive
reviews and P2 represents the proportion of negative reviews.
9) Remove short reviews: After all of the steps above,
To classify the documents in D on some feature with
some reviews only have a few words. They are too short
attributes A{a1 , ...av }, the whole documents D will be split
and may bring noise to the classifier. The threshold N is
into v partitions {D1 , D2 , ..., Dv }.The memory need to store
defined as once the number of words in the cleaned review
those small partitions are the entropy after splitting:
is less than N, then that review is removed.
v
After precessing, data is converted to a sparse two- X |Dj |
Inf oA (D) = − × Inf o(Dj )
dimensional matrix. The rows and reviews have the same
j=1
|D|
number, the columns and unique words have the same
number as well. Each element represents the number of where |Dj | is the number of documents in Dj and
appearances of each word in each review. Info(Dj ) can be calculated the same way as the formula
above.
The formula for IG is simple but it may take a long time
C. Feature Selection and Vectorizer
to compute for a large dataset. IG for each word need to be
According to Forman, feature selection is essential to make computed separately. For my dataset, the word vectorization
an efficient and more accurate classifier. Sharma in 2012 [8] is saved to a sparse matrix in python and the label is saved to
investigated the performance of different feature selection another dataframe. It is time-consuming to count how many
methods for sentiment analysis. It shows that information instances in each label by summarizing columns in a sparse
gain gives consistent results while gain ratio performs best matrix one by one. To simply the computation, the word
overall. This result is from a movie reviews data with 2000 vectorization is transformed to a binary matrix rather than
documents. This experiment will be repeated for the Steam the times of appearance. Each element is a binary variable
Dataset. indicating the present of the word in the review. So v in
1) N-gram: N-gram is a way to generate feature. Dave [7] the above formula should always equal to 2. Also, only one
found that in some settings bigram and trigram perform better hundred thousand reviews are randomly selected for the IG
than unigram. For bigram, every pair of words (w1,w2) next part. The number of features used for the final model is
to each other in the document is selected as a feature. In this selected with cross validation.
representation, a feature is associated with all the bigrams in F. TF-IDF
the document. The feature value is defined as the number
of times the bigram occurs in the document. For bigram TF-IDF is the short for term frequencyinverse document
features, features in the unigram are also included. This is frequency. It is a vectorizer and TFIDF-transformed data
the same for trigram, features from both bigram and unigram can be used directly for the classifier. The term frequency
are also included. The performance of unigram, bigram, and is calculated for each term in the review as
trigram are tested with cross validation. number of times terms t appears in document d
T F (t, d) =
total number of terms in documentd
D. Popularity score where t is the term and d is the document.
The Inverse Document Frequency is calculated by
The number of positive, negative and neutral words can be
total number of documents D
sat in three extra columns. However, they are removed from IDF (t, d) = log( )
the final model because they are accurate for stemmed word number of documents wth the term in it
in SentimentIntensityAnalyzer in Python’s NTLK package. After calculating TF and IDF with the two formulas above,
For example, awesome is a positive word with polar- the TFIDF is calculated by
ity scores greater than 0.5, but the stemmed version of
T F IDF (t, d, D) = T F (t, d) × IDF (t, D)
awesome, which is awesom is a neutral word. Also, like
and likes have different polarity scores. So popularity score TFIDF can be used as a feature selection method to select
was not used. top n features from all features with the largest weights. The
weight for each term is defined by taking average of that
term in all documents. The number of features selected for
the final model is chosen by cross validation.
Other improved version of TFIDF has been developed
to improve the performance. Wang 2010 [10], distribution
information among classes and inside a class is used to
develop the weights function. Moreover, Martineau in 2011
proposed Delta TFIDF [11] and the accuracy has dramati-
cally improved with SVM classier.
Fig. 4: How many positive and negatives reviews for each game?
III. M ODEL The red parts in the histogram in Figure 4 are the number of
positive reviews and the blue parts are the number of negative
reviews. Most of games have less than 20 reviews. On the
A grid search is performed to find the best combination
other hand, the top five most popular games got about 30%
of hyperparameters. The possible parameters can be tuned
of all reviews.
including review minimum length, maximum of number of
reviews appeared, minimum of number of reviews appeared,
gram range, number of features, feature selection method,
include additional features or not, and model. The space of
paymasters are listed in the TABLE I:
IV. RESULT
In gathered data, there are 7,705,997 reviews from 22,548 Fig. 6: Word Cloud for all words in positive reviews
games are collected. Among those reviews, 6,410,832 re-
views are positive and 1,295,165 reviews are negative. Data
was balanced by keeping all the negative reviews and ran- In Figure 6, it shows the word cloud for all words appeared in
domly subset the same number of positive reviews from the positive reviews. The word mean should be a negative word
data. After gathering and cleaning the data, a world cloud is but it seems to appear a lot of times in the positive reviews.
showed in Figure 5. Other than mean, most words are positive and neutral.
Figure 10 compares the accuracy of Decision Tree Model
and Gaussian Naive Bayes Model. The Decision Tree Model
performs much better overall than the Gaussian Naive Bayes
Model. The reason to explain this may be the feature
selection method is TFIDF. The results should be different
if the feature selection method is IG.
APPENDIX
In Figure 13, it shows the features selected by the best
model. Some terms have very strong sentiment, such as R EFERENCES
waste, highly, fine, and really. The terms selected by TFIDF
[1] Fang, X., & Zhan, J. (2015). Sentiment analysis using product review
do make sense. data. Journal of Big Data, 2(1). doi:10.1186/s40537-015-0015-2
[2] Utz, S., Kerkhof, P., & Bos, J. V. (2012). Consumers rule: How
consumer reviews influence perceived trustworthiness of online stores.
Electronic Commerce Research and Applications, 11(1), 49-58.
doi:10.1016/ j.elerap.2011.07.010
[3] Tan, L. K., Na, J., Theng, Y., & Chang, K. (2011). Sentence-Level
Sentiment Polarity Classification Using a Linguistic Approach. Digital
Libraries: For Cultural Heritage, Knowledge Dissemination, and Fu-
ture Creation Lecture Notes in Computer Science, 77-87. doi:10.1007/
978-3-642-24826-9 13
Fig. 14: What terms selected by IG? [4] Krikorian, Raffi. (VP, Platform Engineering, Twitter Inc.). ”New
Tweets per second record, and how!” Twitter Official Blog. August
16, 2013.
In Figure 14, it shows what words are selected by IG. It may [5] Sobkowicz & Stokowiec (2016). Steam Review Dataset - new, large
not agree with words selected by TFIDF because different scale sentiment dataset.
[6] Forman, G. (2003). An extensive empirical study of feature selection
datasets are used for those methods. Also, for IG, only metrics for text classification. The Journal of Machine Learning
unigram is used without bigram and trigram. Research, 3, pp. 1289-1305.
[7] Kushal Dave, Steve Lawrence & David M. Pennock (2003). Mining
the Peanut Gallery: Opinion Extraction and Semantic Classification of
Product Reviews. In Proceedings of the 12th International Conference
on World Wide Web, WWW 03, pages 519528, New York, NY, USA,
2003. ACM
[8] Sharma & Dey (2012). Performance Investigation of Feature Selection
Methods and Sentiment Lexicons for Sentiment Analysis. Special
Issue of International Journal of Computer Applications, June 2012
[9] Li, P., & Huang, H. (2002). Improved feature selection approach
TFIDF in text mining - IEEE Conference Publication. Retrieved from
http://ieeexplore.ieee.org/document/1174522/
[10] Wang, N., Wang, P., & Zhang, B. (2010). An improved TF-IDF
weights function based on information theory. 2010 International Con-
ference on Computer and Communication Technologies in Agriculture
Engineering. doi:10.1109/ cctae.2010.5544382
[11] Martineau & Tim Finin. (2011). Delta TFIDF: An Improved Feature
Space for Sentiment Analysis. Association for the Advancement of
Artificial Intelligence.