Coastal Sentiment Review Using Naïve Bayes With Feature Selection Genetic Algorithm
Coastal Sentiment Review Using Naïve Bayes With Feature Selection Genetic Algorithm
Abstract.
Purpose: The tourism potential in the maritime sector can be Indonesia's mainstay at this time, especially in enjoying
the charm of the natural beauty of the coast as people know Indonesia is an archipelagic country. The purpose of this
study is to find the best model by applying the feature selection genetic algorithm (GA) and Information Gain (IG) to
get the best Naïve Bayes (NB) model and the best features to produce the best level of sentiment classification accuracy.
Methods: The stages of the research were carried out by going through the process of searching, pre-processing,
analyzing research data using the Naïve Bayes model and optimizing genetic algorithms, validating data, and model
evaluation.
Results: The experimental results show that the best model is naïve Bayes based on information gain and the genetic
algorithm yields an accuracy rate of 86.34%.
Novelty: The main contribution to this research is proposing a new model of the best NB optimization model by
applying an optimization algorithm in the search for feature selection to increase sentiment classification accuracy.
Keywords: Coastal, Naïve bayes, Information gain, Feature selection, Genetic algorithm
Received April 2023 / Revised April 2023 / Accepted May 2023
This work is licensed under a Creative Commons Attribution 4.0 International License.
INTRODUCTION
Indonesia has currently named an archipelagic country that is famous around the world for its biggest
potential in the maritime sector. This maritime sector has become one of the prima donnas, both local and
foreign tourists deliberately visit to enjoy the natural beauty of Indonesia's coasts. Assessment of people's
perspectives on such a beautiful coastal destination will be very influential on the number of people who
are interested in following others that have previously seen or enjoyed the beach [1], [2]. Social media is
often used by people or tourists in assessing the beauty of beaches [3],[4]. We cannot deny that currently,
social media has a big influence compared to other media such as newspapers and other print media [5],
[6].
Social media is media that provides an overview of something even if the news is true or not. It is very
influential on a person's life and decisions in taking action [7]. Assessment of someone's review of a place
is currently often carried out by other people through social media as a means to give a review or assessment
of the place therefore others know regardless of whether the review is positive or negative [8]. The same is
true with many people reviewing coastal locations, especially in southern coastal areas of Java. The
assessment of coastal sentiment reviews that many people do indirectly greatly influences the potential for
coastal maritime tourism.
Sentiment analysis is a field of science that utilizes artificial intelligence to enable it to provide decision
support in assessing sentiment by categorizing whether the sentiment is positive, negative, or neutral [9].
The application of SA is often carried out using machine learning in which each applied algorithm produces
a different level of accuracy according to the strengths and weaknesses of each model. Various fields have
used sentiment analysis as a model [10], [11] with the hope that it can help in providing decision support
for each policy to be decided. Some examples of research in the field of sentiment analysis research are
*
Corresponding author.
Email addresses: [email protected] (Somantri)
DOI: 10.15294/sji.v10i3.43988
Of the several algorithms often used according to their advantages for sentiment analysis, especially
classification, are the neural network algorithm and the naïve Bayes (NB) algorithm. Neural networks have
the advantage of being able to carry out learning to work based on the initial experience of the model and
can perform calculations in parallel [21], [22]. Naïve Bayes is usually used to apply small-scale data for
training, besides has been widely used for data processing, especially text mining since it has a better level
of accuracy [23], [24]. Based on its advantages, NB is applied for sentiment reviews of coastal assessments
in the hope of producing a good and high level of model accuracy.
The problem that occurs in this algorithm is that there are still parameter values that must be given manually,
making it difficult to get the best model and the best weight features. In addition, the determination of the
initial weight value in the processed model is still not optimal, so it requires optimization. The purpose of
this study is to find the best model by applying an optimization algorithm to get the best model of the two
algorithms applied and to get the best weight and parameter values to produce the best level of sentiment
classification accuracy.
The research was conducted by Yan, Yingwei, et al [25] using social media as material in assisting planning
for the recovery of tourist destinations, especially in the Lombok and Bali areas after the 2018 disaster. The
public's view of post-disaster tourist destinations, especially in the Bali and Lombok regions, relies on beach
tourism. This study proposes the Latent Dirichlet Allocation (LDA) method with the data from Twitter, and
the results of the research show that the proposed and implemented approach can effectively reveal various
kinds of sentiments and community perspectives on issues regarding post-disaster tourism recovery from
time to time.
Subsequent research was conducted by Park, Eunhye., et al [26] to empirically test the effect of news in
predicting the level of tourist arrivals. In this study, data sources originating from news source topics were
extracted into data used for forecasting tourist arrivals, especially in Hong Kong. The proposed method is
the Autoregressive Integrated Moving Average (ARIMA) method by performing feature selection for
selecting variables first. The proposed research model helps tourist destinations in overcoming the
externalities of reporting in the news media that affect people's sentiments in assessing a tourist destination.
Another study related to destination sentiment was conducted by Ali, and Twiland [27] to get the best model
of tourist experience sentiment in Morocco. In this study, a combined model is proposed using a
combination of topic modeling and lexicon-based algorithms using Latent Dirichlet Allocation (LDA)
where the data comes from TripAdvisor reviews of various tourist attractions in Marrakech, Morocco. The
next research is slightly different, which was conducted by Sohrabi, B [28] that proposed a model for
Based on some of the results of previous studies, the researchers conducted experiments using an
algorithmic model without optimizing the model, therefore the resulting level of accuracy was not optimal.
On the other hand, these limitations are accompanied by the source of the research dataset used in the form
of a review of tourist attractions. It use social media data and its influence has an impact on the pre-
processing of data that leads to different resulting level of accuracy. For this reason, one of the efforts to
increase the level of accuracy produced by the sentiment review model of the coast as a tourist attraction is
to propose a new model using Naïve Bayes (NB). This article proposes Naïve Bayes for the classification
of sentiment review destinations on the southern coast of Java Island as a recommendation to increase
maritime tourism visits, especially beach tourism based on feature selection using a genetic algorithm. The
main contribution to this study is to apply an optimization algorithm for feature selection in the NB model
to increase the accuracy of the sentiment review classification.
METHODS
Dataset
The research carried out is using experimental research methods, where the data will be processed and input
into the model to get the model with the best level of accuracy. The data used in this study were taken from
the website https://www.google.com/maps, by entering the keyword "beach" which then the search results
contained a review of places according to the desired beach name with different star rating values. The
difference is between 1 to 5 where the rating value of 5 is the highest positive sentiment. The data taken is
Indonesian language text data taken from 2018 to 2021 which is then processed into the desired model of
390 data, examples of data taken are shown in Table 1.
• [Pantai yg ramah buat ciblonan.. walau agak kotor • [Pantai nya kotor banget, beli mendoan seporsi
tapi masih tetap indah. Kita juga bisa menyebrang isi 8 biji 28.000, wkwk padahal di pantai
naik perahu menuju pantai pasir putih, banyak batu2 sodong mendoan seporsi isi 8 cuma 15.000.]
kecil yg indah berwarna warni.]
The process of determining to label in this study was carried out based on the assessment of the asterisk. It
is included in negative sentiment if the data is given 1-3 stars, and it is included in the positive sentiment
category if the data is rated 5 stars. In this study, the sentiment sought in the model is only limited to positive
and negative sentiments.
Preprocessing Data
In the proposed research, the next step after the dataset in the form of a coastal review text is obtained is to
do data preprocessing. The stages of the preprocessing process in this study were carried out to obtain the
expected text data, namely data cleansing, tokenization, stemming, and data filtering [29], [30]. The next
process is the data is input into a predetermined model, namely the model using Naïve Bayes.
(1)
In equation (1) where X is data with unknown class, H is hypothesis data X, P(H|X) is the probability of
hypothesis H based on X conditions, P(H) is the hypothesis probability of H, P(X |H) is the probability of
X based on the conditions in the hypothesis H, and P(X) is the probability of X.
To get a performance value from the sentiment classification obtained, this study uses equation (2) [35],
[36].
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (2)
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
In equation (2), where TP is True Positive, TN is True Negative, FP is False Positive, and FN is False
Negative.
As shown in Table 2, it can be seen that the highest level of accuracy obtained was 67.44% which was
obtained using the Fold=6 parameter with the stratified sampling method. In this experiment, the results
still need improvements therefore at a later stage another model was implemented for optimization.
90,00%
80,91%
80,00%
Acuracy
70,00% 67,44%
60,00%
Model
To compare feature selection optimization, another experiment was carried out using different parameters,
namely population = 5 and selection scheme = roulette wheel which produced experimental results as
shown in Table 5.
The best model for optimizing the NB method using a genetic algorithm-based feature selection is 86.34%
with a micro average accuracy of 86.41%. This increase in accuracy has a significant impact on model
accuracy, starting from 80.91% to an increase of 5.43%. The experimental results obtained are shown in
Table 6, besides that by using formula (2) the micro average accuracy is 86.41%.
Table 6. Confusion matrix result
True Negative True Positive Class Precision
Prediction negative 100 13 88,50%
Prediction Positive 40 237 85,56%
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
337
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = 0,8641
390
To evaluate the performance evaluation of the model obtained, the AUC (Area Under the Curve) value is
applied [40]. In this best model, it was found that the AUC value was 0.79 and the AUC value for this
model is included in the "fair classification" criteria in the table. The AUC category table itself is shown in
Table 7.
Although the model proposed and obtained using feature selection-based Naïve Bayes is currently in the
"fair" category, in terms of accuracy it still produces a fairly high value of 86.35%. However, in the future,
it still requires an increase in accuracy, especially the resulting AUC value.
Evaluation Model
The results of the experiments that have been carried out show different levels of accuracy. This provides
evidence that the level of accuracy obtained does not depend on just one model or algorithm that can be
applied, but still requires optimization efforts that can be maximized if the desired model is not sufficient
or in accordance. Changes in the parameter values in each model greatly influence the level of accuracy
and this makes a great effort in determining the parameter values to match what we expect.
Based on the experimental results, the evaluation of the model was carried out by comparing several
experimental methods that obtained the results obtained using classic Naïve Bayes (NB), NB with
Information Gain, and NB and Information Gain optimized using a genetic algorithm (GA). The results of
the model evaluation are shown in Figure 5 and Table 8.
100,00%
86,34% 85,64%
80,91%
80,00% 67,44%
Accuracy
60,00%
40,00%
20,00%
0,00%
Model
Based on Table 6, it can be seen that if we evaluate and compare the several models that have been obtained,
it can be seen that the highest accuracy rate is 86.34% using the NB_IG algorithm based on feature selection
using GA using selection scheme = tournament, and the folds used are 9, and population = 5. Based on the
results obtained in this case, the proposed model is a model that has a greater degree of accuracy compared
to other models.
CONCLUSION
The sentiment review assessment model for the coast using the Naïve Bayes algorithm has been obtained
after optimization with the highest accuracy rate of 86.34%. The model obtained provides a benefit that can
be used by policymakers, especially related parties, to improve coastal maritime tourism, be it services,
facilities, or other things that can be optimized. The level of accuracy produced at this time requires efforts
to improve accuracy, so further research are needed. It is recommended for further research to optimize
from various angles such as data pre-processing, selectinging the best parameters, and optimizing weight
values. In addition, it is necessary to apply other algorithms to obtain other experimental results therefore
the best model can be seen and applied.
ACKNOWLEDGEMENT
Thank you to the Ministry of Education and Culture through the Academic Directorate of Vocational
Education, and the Directorate General of Vocational Education, for funding this research through the
Beginner Lecturer Research scheme for the 2022 implementation year.
REFERENCES
[1] V. Teles da Mota, C. Pickering, and A. Chauvenet, “Popularity of Australian beaches: Insights from
social media images for coastal management,” Ocean Coast. Manag., vol. 217, p. 106018, Feb.
2022, doi: 10.1016/j.ocecoaman.2021.106018.
[2] I. I. Ibrahim, “Social Media Analysis in Building Customer Trust - Systematic Literature Review,”
in Proceedings of 2022 International Conference on Information Management and Technology,
ICIMTech 2022, Aug. 2022, pp. 39–44, doi: 10.1109/ICIMTech55957.2022.9915099.
[3] M. T. Cuomo, I. Colosimo, L. R. Celsi, R. Ferulano, G. Festa, and M. La Rocca, “Enhacing