0% found this document useful (0 votes)
6 views7 pages

Sutabri 2018

This conference paper discusses the enhancement of the Naïve Bayes method for sentiment analysis specifically in the hotel industry in Indonesia, utilizing reviews from the e-traveling site traveloka.com. The study aims to improve the accuracy of sentiment ratings by analyzing user testimonials, achieving a success rate of 89% in sentiment classification. The methodology includes data scraping, preprocessing, and applying a probabilistic model to better align ratings with user sentiments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views7 pages

Sutabri 2018

This conference paper discusses the enhancement of the Naïve Bayes method for sentiment analysis specifically in the hotel industry in Indonesia, utilizing reviews from the e-traveling site traveloka.com. The study aims to improve the accuracy of sentiment ratings by analyzing user testimonials, achieving a success rate of 89% in sentiment classification. The methodology includes data scraping, preprocessing, and applying a probabilistic model to better align ratings with user sentiments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334854114

Improving Naïve Bayes in Sentiment Analysis For Hotel Industry in Indonesia

Conference Paper · October 2018


DOI: 10.1109/IAC.2018.8780444

CITATIONS READS
26 825

4 authors, including:

Agung Suryatno Edi Surya Negara


Asia e University Universitas Bina Darma
3 PUBLICATIONS 36 CITATIONS 81 PUBLICATIONS 509 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Edi Surya Negara on 07 March 2020.

The user has requested enhancement of the downloaded file.


Improving Naïve Bayes in Sentiment Analysis
For Hotel Industry in Indonesia
Tata Sutabri Agung Suryatno Dedi Setiadi Edi Surya Negara
Information Technology Faculty Technical Information Faculty of Computer Computer Science
University of Respati Indonesia STMIK ERESHA MH.Thamrin Univesity Universitas Bina Darma
Jakarta, Indonesia Banten, Indonesia Jakarta, Indonesia Palembang, Indonesia
[email protected] [email protected] [email protected] [email protected]

Abstract— in the online ordering process, sometimes purchasing there are many e-traveling sites, such as wisatakita.com, pegi-
services often face problems in determining the services chosen pegi.com, booking.com, tripadvisor.co.id, traveloka.com,
closest to the characteristics of the user. Ratings used by some trivago.com, wisatakita.com, misteraladin.com, and so on, which
marketplace are sometimes not objective with the content of provides facilities for tourists to write testimonials about their
reviews provided by users. This will reduce the level of trust the
opinions and personal experiences online on the site.
user provides in the ratings provided by the service. Therefore, this
study will try to produce a comprehensive analysis, by reading and
analyzing any reviews related to certain services. The burden for The e-traveling site that is the object of this research is
users is the number of reviews that are not small and the use of very the traveloka.com site; the reason is that traveloka is the first
different language styles. This study proposes a method that can tourist travel site in Indonesia based online, since March 2012.
provide a rating that is more in line with the content of the review In addition, the number of hotel service users or tourists who use
in connection with the sentiments in the review. The method traveloka services is very large. There are 19,272 hotels in
developed using the corpus on the topic model on the hotel Indonesia promoted through traveloka.com, so this e-traveling
management site. Sentiment analysis was obtained using the Naïve site is the best-selling and trendy, used by domestic tourists. One
Bayesian method and the use of probabilistic values of the corpus.
problem that arises is that tourists or visitors must read all the
The test results showed the success rate of the method in analyzing
sentiment was 89%. The results of sentiment analysis are used as a testimonials in their entirety, so that it takes a long time. In
standard for calculating rating. addition, it was also found that the ratings or scores given in the
evaluation of testimonials were sometimes not in accordance
Keywords—analysis sentiment, corpus, naïve bayesian, topic with testimonials written by tourists or hotel service users.
model, hotel riview.
When a tourist or hotel service user, searches for tourist
I. INTRODUCTION destinations and hotels in certain tourist destinations, they will
usually look for hotel testimonials online in the destination city,
to make hotel booking decisions. These testimonials are
The development of online media today, has a positive
sometimes doubted by tourists or new users of hotel services,
impact with the emergence of unlimited textual information,
because it is very difficult to read and understand all these
resulting in the need to represent that information, without
testimonials in a short time.
reducing the value of information. Textual information is
divided into two, namely facts and opinions. Facts are objective
expressions of an entity or an event, whereas opinions are What was done in this study, looked at the limitations
subjective expressions that express one's sentiments or opinions of the hotel testimonials and analyzed sentiments, to determine
about an entity or event. (Y.Nur & D.Santika, 2011). The the positive or negative testimonials and ranked the hotel
amount of information in the form of user testimonials for testimonials, by applying the topic model approach, which uses
various items ranging from computer products, smart phones, generative techniques to model the topics contained in the
holiday services, hotel services to movie reviews. At present the testimonial document. The topic of the model was built to fit the
valuable source of knowledge will greatly help other users, find satisfaction measurement categories contained in e-traveling
the information needed, and make accurate decisions for the sites such as, cleanliness, comfort, food, location and service.
various interests needed. The Corpus is built on expert knowledge, which is used to
analyze the sentiments of hotel testimonials online using the
classification method.
The Tourism Industry is an object that has a great
opportunity to be promoted massively and developed online
through a website. Most of the tourist destinations currently The previous forms of research relating to sentiment
available make it easy for tourists to provide accommodation analysis have been conducted by J. Samudra, S. Supeno, and M.
and comfort during the holidays. (E.Indrayuni, 2016). Hotels are Hariyadi, in 2009, dividing or classifying text as an alternative to
a very important tourist product to pay attention to in terms of processing digital documents, so as to simplify and accelerate
facilities, excellent service or the distance to the hotel. Currently the search for information needed. The method used is Naïve
Bayes. Text documents are represented as a set of words, and No Author Research Theories
each word in the document is considered independent of each Year Topics Has Been
other. Developed
Research conducted by TB. Adji, GA. Buntoro, and 4. Brob, J., Aspect Oriented Distant
A.E Purnamasari in 2014, conducted a research on community 2013 Sentiment Analysis of supervision
sentiment analysis on social media issues, especially Twitter, Customer Reviews Using technique to
using a combination of Lexicon-based and Double Propagation Distant Supervision reduce human
methods which produced 7 parameters such as very positive, Techniques supervision in
positive, somewhat positive, neutral, somewhat negative, the annotation
negative and very negative with an accuracy rate of 23.44%. process.

In addition, other studies conducted by M.El-Din, 91% accuracy is


H.Muktar, and Ismail in 2015, conducted sentiment analysis to correct in
automatically detect the subject of information such as emotions giving labels /
and feelings. Online testimonials on paper can be a reference annotations in
source, where information can save time in reading reviews. the process of
seeing the
The next explanation, Naïve Bayes method will be corpus.
described which is the basis for developing the proposed
method, by utilizing the corpus that has been formed, then 5. Changlin Topic and Sentiment Sentiment
followed by a discussion of the results of the research and classification
Ma, Meng Unification Maximum
concluding with conclusions. with Maximum
Wang, Entropy Model for
Entropy.
II. LITERATURE REVIEW dan online review analysis
Xuewen Methodology
The Research related to this sentiment analysis has been carried Chen, Object: restaurant with Topic
out by several researchers. The results of the study are 2015 Sentiment
summarized in Table 2.1. The summary table describes authors Unification
and years, research topics, and theories that have been developed Maximum
from existing research. Entropy-LDA
(TSU Max-
Table 2.1: Summary of Related Research LDA).
No Author Research Theories
Year Topics Has Been
Developed III. METHODOLOGY
1 Jo, Y.,and Aspect and Sentiment Combination of
Oh, Alice, Unification Model for S-LDA and The research conducted by Indrayuni (2016), explained that the
2011. Online Review ASUM world of tourism is one of the attractions that can be promoted
Analysis. methods. and developed through online sites. The increasingly widespread
e-traveling sites that exist today, provide comfort and
Dataset: electronic convenience to tourists to get accommodations during tourist
and restaurant trips. In addition, hotel facilities are one of the most important
tourism products and must be considered in terms of services
2. Boiy, E.; Automatic Sentiment Test sentiment and facilities. Currently there are many e-traveling sites such as
Hens, P.; Analysis in On-line analysis GoIndonesia.com, Tiket.com, valadoo.com, traveloka.com,
Deschacht Text. techniques. burufly.com, bobobobo.com and others, which provide facilities
Moens, for tourists to use their services, when going to do Tour.
Dataset: review film SVM and Naive
M.F., Bayes tools / The data sets used in this study were obtained from two different
2007 applications sources. The first source is a review of the popular e-traveling
site, the second source from articles relating to the hospitality of
3. Rui Xie, Integrating Topic, Combining Part some specific sites. Two types This data set, used to build corpus
Chunping Sentiment and Syntax of Speech in the and sentiment analysis. Amount The reviews used to build the
Li, dan for Modeling Online model. corpus are 14,323 reviews from 625 hotels, while related articles
Qiang Tag Sentiment are used, there are 10 articles related to cleanliness, comfort,
Review
Ding, Li Li, Aspect Models food, location, and hotel services, as an attribute or parameters.
2014 (TSA).
The data acquisition process, to get hotel reviews from popular 3.1.1. Sentiment Analysis by Multinomial Naïve Bayes.
e-traveling sites, done using web scrapping automatically, using
the application UBOT studio, without API (application The classification method used in this study is Naïve
programmable interface). After the scrapping process reviews of Bayes Classifier, to classify online testimonial data from leading
hotels finished, texting the hotel reviews, which consists of e-traveling sites. The current Naïve Bayes Classifier method has
tokenizes process, stop word removal and stemming. been developed to calculate the probabilistic size of each word
and provide an assessment for each class. One of them is the
In addition, text-processing activities conducted on 10 related Multinomial Naïve Bayes model developed by Manning et al
articles with cleanliness, comfort, food, location, and hotel (2008). This method estimates the conditional probability of a
services, as an attribute or its parameters. The articles are taken token that has a class, as the relative frequency of the word t in
from the URL https://phinemo.com, www.swiss-belhotel.com, the document belonging to the class c.
www.triadvisor.com, and https://perantara.net/.

The explanation steps can generally be seen in the general


overview of the study. This Research, aimed to analyze an
online hotel review, was written by the users of the hotel The Naïve Bayes Multinomial Method takes into account the
services. Broadly speaking the process of analysis such as number of occurrences of the word t in class c training
documents, as well as several existing events. The process of
presented in Figure 3.1.
training documents with Multinomial Naïve Bayes can be seen
in Algorithm_1.

Algorithm_1. Training document by multinomial naïve bayes


Input : Document D, Class C
Output : Vocabulary V, Prior Knowledgeprior,
Likelihood condprob
a) Extract vocabulary V from document D
b) Calculate the number of N documents D
c) For every c C
Calculate Nc as number of D documents that have class c
1. Calculate prior [c] = Nc / N
2. Combine all text in document D that has class c into textc
3. for every t V
Calculate Tct as the number of tokens appearing from textc
which has class c
4. for every t V
Calculate Likelihood condprob [t] [c] =

The testing phase based on the results of training data can be


used Algorithm_2.

Algorithm_2. Testing document by multinomial naïve bayes


Input : Class C, Vocabulary V, Prior Knowledgeprior,
Likelihood condprob, Test document d
Output :
Fig. 3.1. Overview of Research
a). Extract token W from test document d based on Vocabulary v
3.1. Sentiment Analysis by Method of Classification b). For each c C
Calculate score [c] = log prior [c]
The analysis of sentiment research conducted is a For every t W
process of classifying textual documents from hotel online Calculate score [c] + = log condprob [t][c]
testimonials, which are divided into two parts, namely positive Count
and negative sentiment classes, using the Naive Bayes
classification method. This classification process begins with
preprocessing which consists of tokenization, stop words,
filtering and stemming, which is then carried out by the Naïve
Bayes classification process.
3.1.2. Sentiment Analysis by Multinomial NB+Corpus. IV. DISCUSSION RESULT

The Naïve Bayes Classifier performance can be According to Tan and Zhang(2008). Data Acquisition
improved by using corpus data that has been created and in English, Data Acquisition abbreviated DAQ is, the sampling
developed in the previous stage. The use of corpus aims to give process of physical real-world conditions and conversions of the
more weight to the parameters of the probability value, for each resulting sample into a digital numerical value that can be
token listed in the corpus. The corpus used is the corpus that manipulated by computer.
deals with the topic of hotel parameters, namely comfort,
cleanliness, location of the hotel, food, and friendly service. In this research, the data acquisition process through
several stages of scrapping, labeling, tokenizes, stop word filter,
Corpus value weights are obtained from probabilistic and stemming. The scrapping process aims to get text data from
values. The occurrence of the term t on the existing topic, the popular e-traveling sites. Each from the review given to the
goal is to normalize the weight. In this study using the expert to make a positive sentiment label or negative.
proportionality of token numbers for each class c, positive
classes p + = 0.65 and negative p- = 0.35 in the data sequence. In addition to labeling sentiments, it is also possible to
So that condprob can be calculated by a formula such as, label the type of data sets as training data or test data. This
process is done randomly without viewing the review content.
This is to maintain data independence. The labeling process does
not change the structure from the content of the review. To
change the review structure to be used as data set then done pre-
To get a score for each class [c] can use the following formula processing. Processing stages consist of tokenizes, stop word
filters, and stemming.

Tokenizes is the process of searching for words from


reviews and removing caps lock as well as punctuation. The
result of this tokenization process is filtered by eliminating
words that are not important terms or words. To perform this
filter, use stop word developed by Tala, 2004. The last stage of
preprocessing is stemming.
3.2. Calculation of Rating with scoring
The stemming process works to remove all affixes
Classification performance in determining sentiment either which consists of prefixes, infixes, suffixes and confixes
analysis from online testimonials, aims to increase the rating (combination of prefix and suffix) to a term or term that has
given by tourists or hotel service users on e-traveling sites. Many been found. Stemming used to change the shape of a word into
found that the scoring given was not in accordance with the the word of the word, which is in accordance with good and
contents of the testimony. To calculate the rating to fit the correct Indonesian morphology structure. Evaluation of
testimonial content, the score [c] is obtained by combining stemming results done manually by making observations directly
Naïve Bayes with the corpus model. to the stemming result. To assess whether stemming results are
done right or wrong, used Big Indonesian Dictionary (KBBI).
The Positive rating is obtained by multiplying the The data sets used consisted of 3,919 reviews as training data,
positive score [c +] by number 5, then adding it to the number 5 and 2,314 reviews as test data. Data sets have a structure
as the initial positive value. While the negative rating without consisting of a review field, sentiment (positive and negative),
being added to number 5. The formula used to search for ratings type of data set (test data and test data), rating, result of
is as follows, tokenizes process, result of stop word filter, and result of
stemming process. The result of the stemming process used for
corpus development process

The Naïve Bayes method in the learning process


consists of prior knowledge and possible value. The process of
storing the value of previous knowledge can also be done by
For example, if the score obtained from an online testimonial is storing the value of occurrence of Nc tokens in each class, both
0.75 with a positive sentiment, then the rating obtained is 5 + positive and negative. This process is done to save storage.
(0.75 x 5) = 8.75. whereas prior knowledge is the result of the occurrence of
tokens with the number of tokens throughout the data set. The
result is a floating number that requires more space than an
integer.
The Possible values will generate by each token for Table_3. Parameters Performances
each class, both positive and negative. For example learning
outcomes in training data with the Naïve Bayes multinomial Parameters Performance Naïve Bayes
Naïve Bayes +
method can be seen in table_1. Corpus
Accuracy 0,86 0,89
Error Rate 0,15 0,11
Table_1. Outcomes of Learning by Naïve Bayes Precision 0,91 0,99
Negative Predictive Value 0,40 0,04
No. Token Nc Nc Likelihood Likelihood Recall 0,92 0,89
Pos Neg Positive Negative Specificity 0,38 0,51
1 Alat 37 9 -6.70895 -6.81991
2 makanan 25 4 -7.08644 -7.51325
3 Ganti 40 18 -6.63397 -6.17835
4 kompor 8 1 -8.14831 -8.42935 IV. CONCLUSION
5 lokasi 217 20 -4.96224 -6.07787
6 kamar 11 1 -7.86193 -8.42985 The application performance from the test results developed on
7 lorong 4 0 -8.73720 -9.12289 the corpus development method and its utilization, for the
8 lewat 3 0 -8.96044 -9.12239
9 tingkat 95 7 -5.78229 -7.04365
sentiment analysis classification using Naïve Bayes, has
10 kebersihannya 1 0 -9.65329 -9.12269 succeeded in answering the formulation of the problem and the
etc. … … … … … purpose of this study. The method developed can be used to
analyze sentiment and give a rating by modifying the possible
It looks the same, when the learning process using the variables in the Naïve Bayes method by multiplying the body
Naïve Bayes method, the proposed method (Naive Bayes + weight and the proportion of positive and negative training data.
Corpus) also produces prior knowledge and possible values. If The performance of the Naive Bayes + Corpus method has
the results of the learning process with the original Naïve Bayes increased, with an accuracy of 0.86 to 0.89.
method compared to the Naive Bayes + Corpus method, it will
produce a possible value that has a relatively longer distance
between positive and negative tokens. For example, the REFERENCES
"holiday" token, in the original method has a positive probability
value of -5.64 and negative -7.71. The proposed method has a [1] A. Josi, L.A. Abdillah, Suryayusra, Penerapan teknik web
positive probability value of -5.64 and negative -36.80. This scraping pada mesin pencari artikel ilmiah., Prosiding, 2
high range value is caused by the use of "holiday" tokens [2] Alexander Pak, Patrick Paroubek., Twitter as a Corpus for
weighing 0.365. The results of learning training data by the Sentiment Analysis and Opinion
Naive Bayes + Corpus method can see in table_2. Mining., Journal ACM, 2010.
[3] Boiy, E.; Hens, P., Deschacht, K., Moens, M.F., Automatic
Table_2. Result of Learning by Naïve Bayes + Corpus Sentiment Analysis in On-line Text.
Proceeding, 2007.
No. Token Nc Nc Likelihood Likelihood [4] Brob, J., Aspect Oriented Sentiment Analysis of Customer
Pos Neg Positive Negative Reviews Using Distant Supervision Techniques,
1 Alat 37 9 -6.70895 -6.81891
2 makanan 25 4 -7.06644 -7.51325
Dissertation, 2013.
3 ganti 40 18 -0.74403 -6.17835 [5] Bin Lu, Myle Ott, Claire Cardie, and Benjamin K. Tsou.,
4 kompor 8 1 -8.14631 -8.42955 Multi-aspect Sentiment Analysis with Topic Models,
5 lokasi 217 20 -4.98204 -6.07787 Journal IEEE, 2011.
6 kamar 11 1 -7.86163 -8.42985 [6] Card S.K., Mackinlay J.D. Shneiderman B. (eds), 2009,
7 lorong 4 0 -8.78710 -8.75882
Reading in Information Visualization, Using Vision to
8 lewat 3 0 -8.99024 -9.12239
9 tingkat 95 7 -5.76219 -10.5768 Think, San Francisco: Morgan Kaufmann
10 kebersihan 1 0 -9.65439 -9.12269 [7] Changlin Ma, Meng Wang, and Xuewen Chen., Topic and
etc. … … … … … Sentiment Unification Maximum Entropy Model for
Online Review Analysis. Journal ACM, 2015.
The proposed Naive Bayes + Corpus method produces better [8] Dumbill, E. 2012. Big Data Now Current Perspective.
performance than the original Naïve Bayes method. The process O'Reilly Media.
results show an increase in accuracy of 0.3 or converted to 3%. [9] Eaton, C., Dirk, D., Tom, D., George, L., & Paul, Z. (n.d.).
The original method has an accuracy of 0.86, while the Naive 2012. Understanding Big Data. Mc Graw Hill.
Bayes + Corpus method has an accuracy of 0.89. Performance [10] Ghulam Asrofi Buntoro, Teguh Bharata Adji, dan Adhistya
calculations for both methods can be seen in table_3. Erna P., Twitter Sentiment Analysis with Combination
Lexicon Based and Double Propagation. Presiding, 2014.
[11] H. Tang, S., Tan, X. Cheng., A survey on sentiment detection
of reviews, Expert Systems with Applications 36(7). Journal,
2009.
[12] Haddi, E., Liu, X., dan Shi, Y. The Role of Text Pre-
processing in Sentiment Analysis. Procedia Computer
Science., Proceeding, 2013.
[13] Ivan Titov, Ryan McDonald., Modeling Online Reviews
with Multi-Grain Topic Models, 2008.
[14] Jo, Y., Oh, Alice, Aspect and Sentiment Unification Model
for Online Review Analysis. Journal ACM, 2011.
[15] Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., & Wang,
S., An Improved Particle Swarm Optimization for Feature
Selection. Journal of Bionic Engineering, 8(2), 191– 200.
doi:10.1016/S1672-6529(11)60020-6. 2011.
[16] Ledy Agusta., Perbandingan Algoritma Stemming Porter
dengan Algoritma Nazief & Adriani untuk Stemming
Dokumen Teks Bahasa Indonesia., Proseding, 2009.
[17] Lyu Kigon, Kim Hyeoncheol., Sentiment Analysis Using
Wrd Polarity of Social Media., Springer Science , Businee
Media, New York, McGinty,L,Smyth,B, “Adaptive
selection :analysis, 2016.
[18] Manning, C.D., Raghavan, P., and Schutze, H.,
Introduction to Information Retrieval. Cambridge
University. 2008.
[19] Medhat, W., Hassan, A., dan Korashy, H., Sentiment
analysis algorithms and applications: A survey. Ain Shams
Engineering Journal., 2014.
[20] Miftah Ardiansyah, Annotated corpus-based topic models
for client analysis system supporting consular dialogue.
Dissertation, 2015.
[21] Neil O’Hare, Michael Davy, A. Bermingham, Paul F.,
Páraic Sheridan, Cathal Gurrin, Alan F. Smeaton., Topic
Dependent Sentiment Analysis of Financial Blogs. Journal
ACM, 2009.
[22] Rui Xie, Chunping Li, dan Qiang Ding, Li Li., Integrating
Topic, Sentiment and Syntax for Modeling Online Review.
Journal, 2014.
[23] Raja Mohana S.P, Umamaheswari K, and Karthiga R.,
Sentiment Classification based on Latent Dirichlet
Allocation., Journal, 2015.
[24] Sianipar Raisa and Budi Erwin Setiawan., Strength
detection sentiment on Indonesian Language Tweet Text
Using Sentistrength. Journal, 2015.
[25] V. Jijkoun, M. de Rijke, and W. Weerkamp. Generating
focused topic-speci_c sentiment lexicons. In Proceedings of
the 48th Annual Meeting of the Association for
Computational Linguistics, pages 585{594, Uppsala,
Sweden, July 2010. Association for Computational
Linguistics.
[26] Zhang, Ye, Q., Zhang, Z., & Li, Y. Sentiment classification
of Internet restaurant reviews written in Cantonese. Expert
Systems with Applications, 38, 7674–7682., Journal, 2011.
[27] Zhang, Harry., The Optimality of Naive Bayes., American
Association for Artificial Intelligence (www.aaai.org).
Journal, 2004.

View publication stats

You might also like