A Machine Learning Approach To Building A Tourism

Uploaded by

meseret system

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

A Machine Learning Approach To Building A Tourism

Uploaded by

meseret system

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/333857452

A Machine Learning Approach to Building a Tourism Recommendation System

using Sentiment Analysis

Article in International Journal of Computer Applications · June 2019

DOI: 10.5120/ijca2019919031

CITATIONS READS

3 2,606

4 authors, including:

Abhishek Kulkarni Aarushi Phade

University of Florida College of Engineering, Pune
6 PUBLICATIONS 5 CITATIONS 1 PUBLICATION 3 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Aarushi Phade on 17 March 2021.

The user has requested enhancement of the downloaded file.

International Journal of Computer Applications (0975 – 8887)
Volume 178 – No. 19, June 2019

A Machine Learning Approach to Building a Tourism

Recommendation System using Sentiment Analysis

Abhishek Kulkarni R. M. Samant Prathamesh Barve Aarushi Phade

Department of Assistant Professor Department of Department of
Information Technology Department of Information Technology Information Technology
NBN Sinhgad School of Information Technology NBN Sinhgad School of NBN Sinhgad School of
Engineering, Pune, India NBN Sinhgad School of Engineering, Pune, India Engineering, Pune, India
Engineering, Pune, India

became easier to book travel and accommodation. These

ABSTRACT developments led to ease on the part of the user. Still, the users
Opinions have become extremely vital in today’s “ratings” must manually go through reviews and decide on the best
driven technological services. An android application, a top- resources among the hundreds available.
tier restaurant or any service for that matter thrives or wanes
away on the reviews it gets. A good review can help attract The proposed system in this paper eliminates this effort. The
potential users while a bad one may drive them away. Thus, it proposed system analyses various reviews of tourist places and
is essential to analyze these reviews to better understand the creates a recommendation list. After taking the interests of the
user’s experience and work towards improving it. The general user, the system creates a tailored tour plan for the user. The user
system that most services use today is based on star-ratings or can search for information of various tourist places as well as
a score out of 5 or 10. Although these serve the most basic explore resources relevant to her/him.
purpose, text-based reviews allow one to understand the reason To generate the said recommendation list, various machine
behind the ratings and are useful to both the user and the learning and deep learning algorithms have been tried out. These
service provider to gain more insight. It is impractical for a include Bernoulli Naïve Bayes, Multinomial Naïve Bayes,
human to go through thousands of reviews and comprehend Random Forest Classifier, Recurrent Neural Networks and
the user’s sentiment. Instead, training an algorithm to do this Convolutional Neural Networks. The basic concept used for this
job is much more pragmatic and the advances in machine work is Sentiment Analysis. Using existing reviews, the model is
learning allows one to do so. This is where sentiment analysis trained to identify to what extent a review is positive and
comes in. In this paper, analysis of various machine learning negative. This positivity or negativity decides the rating of the
algorithms like Multinomial Naïve Bayes, Random Forest tourist place in the system. Further, the recommendation list
Classifier and Bernoulli’s Naïve Bayes has been done and their considers both the rating and user’s interests to find the best
behavior has been studied. In addition, study of Convolutional trade-off between the two.
Neural Networks and Recurrent Neural Networks is done to
find out if deep learning algorithms perform better. Using 2. LITERATURE SURVEY
these results, a recommendation system is built that maps an In [1] it gives us an outline of intelligent tourism system.
individual user’s interests to the highest rated tourist places Different modules of intelligent tourism system like place
and generates a unique tour plan that is tailored to the user’s recommender system, database thinking, information delivery for
needs. tourism etc. are described and their points of interest and
breaking points are tended to. It looks at two best recommender
Keywords framework advancements, Tripplehop's TripMatcher and
Machine Learning, Sentiment Analysis, Tourism, VacationCoach's Expert Advise Platform, MePrint.
Recommendation System, Recurrent Neural Network
In [2] Hybridization of collaborative filtering and content- based
1. INTRODUCTION recommendation system is studied. They have utilized IMDB
The tourism industry has exploded in recent years. This dataset to suggest, having a set of 13 features to recommend a
explosion has led to the industry becoming more dynamic and movie. Optimal feature weights are considered, and a regression
user driven. With the advent of new technology, it becomes framework is described. Additionally, execution of the
imperative to entwine the two and create better, more efficient framework is examined.
solutions to these dynamic problems. As with any technology,
its design revolves around the user. The user’s capabilities and In [3] multiple content-based recommendation models like
proficiency become a major factor which decides the TFIDF profile model, BM25 profile model are presented and
complexity and usefulness of any technology. Various factors assessed. So as to investigate the performance of the
like businesses coming online, increase in the quality of methodologies these two recommenders are thought about
utilizing two diverse datasets acquired from Delicious and
global positioning systems and the popularity of social media Last.fm social frameworks.
has led to the tourism industry becoming multifaceted. In the
20th and early 21st century, planning a trip would take In [4] a positioning framework for suggestion of items that gives
immense efforts on the user’s part. This included contacting best incentive to shoppers' cash is proposed to be created. It
every hotel for booking information, arranging travel and utilizes novel dataset of US lodging reservation. In view of
deciding on various places to visit. The world wide web evaluations from the model, monetary effect of different
completely changed the scenario. Tourists came online and administration and area qualities of lodging is inferred.In [5]
wrote reviews of places, businesses came online, and it contrasts and subtleties between two distinct methodologies for

48
International Journal of Computer Applications (0975 – 8887)
Volume 178 – No. 19, June 2019

text classification, for example Multivariate Bernoulli model generally want, a full score is given. Following formula has been
and Multinomial model are portrayed. In result it states, devised for the same,
Multivariate Bernoulli algorithm performs good on little
vocabulary sizes and Multinomial performs better at large Score(place=X) = 10*ambience + 10*cleanliness + 10*must-
vocabulary sizes. Execution of Multinomial Naïve Bayes can visit + nightlife*NightlifeUserValue +
be upgraded by utilizing locally weighted learning.[7] parking*ParkingUserValue +
peacefulness*PeacefulnessUserValue +
In [6] various ways to deal with make a recommendation list childSafety*ChildSafetyUserValue + 10*ratings
for the travel industry are examined. It expresses that by
utilizing content-based scoring, framework can utilize typical Here, the maximum value of each feature is 5, thus the maximum
tourist media information to include scores-based contents and total score of any place will be 400. The features ambience,
its semantics to the general inference process. cleanliness, must-visit status and ratings are preferred by most of
the users, so it is assumed that its value will be maximum. For the
In [8] Ensemble classification is used to analyze sentiment other features, user’s inputs will determine the value. Thus, the
analysis on twitter dataset. Ensemble classification includes summation of these values will result in a holistic score of each
joining the impact of different autonomous classifiers on a tourist place and arranging these scores in descending order will
specific issue which beats traditional Machine Learning generate the recommendation list.
classifiers by 3-5%
3.3 Getting Data
[9] This paper actualizes Sentiment Classification task on Deep learning algorithms used for Sentiment analysis require a
Amazon Fine Food reviews dataset and Yelp challenge vast dataset. For this reason, Amazon Product Reviews dataset
dataset. James Berry thought about two methodologies, first - from Kaggle having 3.6 million reviews has been used. From this
conventional Bag of Words approach using Multinomial Naïve dataset, 1 million reviews are taken. These 1 million reviews
Bayes and Support Vector Machine Classifiers and second – contain 600,131 positive reviews and remaining 399,869 negative
Long Short-Term Memory (LSTM) Recurrent Neural Network reviews. Along with these surveys have been gathered of better
with GloVe Embeddings and self-learned Word2Vec places utilizing Google API. When one looks through a spot-on
embeddings. This paper concludes LSTM is best performing Google, Google API returns data about that place alongside 5
algorithm. most recent surveys for each spot. Additionally, reviews on
destinations like TripAdvisor, Google which are openly
3. PROPOSED METHODOLOGY accessible and are taken to assemble the dataset. Likewise, to get
3.1 Overview of the System the reviews a survey was conducted getting reviews about various
The proposed system aims to reduce the effort on the user’s places. Utilizing these, a sum of 30,000 surveys of better places
part. The system will create a recommendation list which is were accumulated. These accumulated surveys are given
curated using the results of analysis of numerous user-reviews classification as positive or negative manually.From above
and the inputs given by the user. The deep learning algorithm dataset, 1 million surveys of Amazon Product Reviews dataset
will determine the extent to which the review of a place is and 20,000 reviews of places are used as training set for
positive and negative. Based on the result, the rank of the place algorithm, while remaining 10,000 reviews are used for
in the recommendation list will be decided. A more positive testing.As a model for the recommendation system, the state of
result will rank the place higher and increase the chances of it Goa from India is considered. 26 places from Goa are chosen.
being recommended while a more negative result will rank the Values for features like ambiance, cleanliness, peacefulness of
place lower, thereby decreasing the likelihood of it being each spot are given physically by perusing surveys.
recommended. Each of these places is categorized based on
what it offers, for instance the Taj Mahal being a 3.4 Model Building
To find out the best performing models, the following machine
historical site offers a historical and heritage value. A user will learning and deep learning algorithms were considered and
enter their preference. This includes the type of location they implemented on the Amazon Reviews dataset:
want to visit (adventurous, historical, architectural, etc.), the
number of people traveling and children (if any), and the I) Bernoulli Naïve-Bayes
number of days they plan to take the trip for. Based on these In the multivariate Bernoulli event model, features are
parameters and the user reviews for each place, the independent Booleans (binary variables) describing inputs. Like
recommendation list will be generated uniquely for that user. It the multinomial model, this model is popular for document
will be mapped to the individual user’s requirements and a classification tasks, where binary term occurrence features are
tailored trip will be generated. Thus, the user won’t have to used rather than term frequencies. If xi is a Boolean expressing
settle for generalized plans that tour businesses generally offer. the occurrence or absence of the ith term from the vocabulary,
This system works in two phases. In the first phase the reviews then the likelihood of a document given a class Ck is given by
and other relevant data is gathered, and an average rating is
assigned to each place. In the second phase, the ratings
assigned in the previous phase and other parameters taken
from the user are utilized to generate a unique
recommendation list. Thus, every user gets a tailored tour plan where pki is the probability of class Ck generating the term xi. This
that actually considers their opinions. event model is especially popular for classifying short texts. It
has the benefit of explicitly modelling the absence of terms.
3.2 Design of the Recommendation List When implemented on the Amazon dataset, it had an accuracy of
The crux of the system is the recommendation list which maps 82.75% and an f1-score of 0.83.
user’s interests to ratings analyzed from reviews. Ratings,
ambience, cleanliness, must-visit, nightlife, parking and
II) Multinomial Naïve-Bayes
With a multinomial event model, samples (feature vectors)
peacefulness are the factors considered while generating the
recommendation list. For features which a user would represent the frequencies with which certain events have been

49
International Journal of Computer Applications (0975 – 8887)
Volume 178 – No. 19, June 2019

generated by a multinomial (p1...,pn) is the probability that IV) Convolutional Neural Network
event i occurs (or K such multinomials in the multiclass case). A convolutional neural network consists of an input and an
A feature vector x = (x1,.....,xn) is then a histogram, with xi output layer, as well as multiple hidden layers. The hidden layers
counting the number of times event i was observed in a of a CNN typically consist of convolutional layers, RELU layer
particular instance. This is the event model typically used for i.e. activation function, pooling layers, fully connected layers and
document classification, with events representing the normalization layers.
occurrence of a word in a single document. The likelihood of
observing a histogram x is given by Description of the process as a convolution in neural networks is
by convention. Mathematically it is a cross-correlation rather than
a convolution. This only has significance for the indices in the
matrix, and thus which weights are placed at which index.
Convolutional networks were inspired by biological processes in
that the connectivity pattern between neurons resembles the
If a given class and feature value never occur together in the organization of the animal visual cortex. Individual cortical
training data, then the frequency-based probability estimate neurons respond to stimuli only in a restricted region of the visual
will be zero. This is problematic because it will wipe out all field known as the receptive field. The receptive fields of
information in the other probabilities when they are multiplied. different neurons partially overlap such that they cover the entire
Therefore, it is often desirable to incorporate a small-sample visual field.
correction, called pseudo count, in all probability estimates
such that no probability is ever set to be exactly zero. This way As expected, the CNN model yielded an accuracy of 94.40%.
of regularizing Naïve Bayes is called Laplace smoothing.
Implementing this algorithm on the V) Long Short-Term Memory RNN
For the neural network approach, LSTM RNNs have been used
Amazon dataset yields an accuracy of 84.48% and an f1-score because they generally have a superior performance than
of 0.85. traditional RNNs. A problem arises when using traditional RNNs
for NLP tasks because the gradients from the objective function
III) Random Forest Classifier can vanish or explode after a few iterations of multiplying the
Random Forest learning is the construction of a decision tree weights of the network. For such reasons, simple RNNs have
from class-labelled training tuples. A random forest is a flow- rarely been used for NLP tasks such as text classification In such
chart-like structure, where each internal (non-leaf) node a scenario, one can turn to another model in the RNN family such
denotes a test on an attribute, each branch represents the as the LSTM model. LSTMs are better suited to this task due to
outcome of a test, and each leaf (or terminal) node holds a the presence of input gates, forget gates, and output gates, which
class label. The topmost node in a tree is the root node. control the flow of information through the network.
Classification and Regression Tree (CART), Iterative
Dichotomiser 3(ID3) and Chi-squared Automatic Interaction An accuracy of 94.56% was obtained using this algorithm
Detector (CHAID) are few types of decision tree learning on the Amazon Reviews dataset.
algorithms. 4. RESULTS
The Amazon Reviews dataset when used to train this For the neural network approach, LSTM RNNs is used because
algorithm outputs an accuracy of 84.60% and an f1-score of they generally have a superior performance than traditional
0.85. RNNs for learning relationships.
A problem arises when using traditional RNNs for NLP tasks Network performs the best.
because the gradients from the objective function can vanish or
explode after a few iterations of multiplying the weights of the 5. CONCLUSION
network. For such reasons, simple RNNs have rarely been Thus, to develop the recommendation list, various machine
used for NLP tasks such as text classification [7]. In such a learning and deep learning algorithms have been discussed to
scenario one can turn to another model in the RNN family analyze the reviews of the Amazon Reviews dataset. As can be
such as the LSTM model. LSTMs are better suited to this task seen from the evidence above, the Recurrent Neural Network
due to the presence of input gates, forget gates, and output proves to be the model which yields the highest accuracy of
gates, which control the flow of information through the 94.56%. Thus, in this experiment a deep learning algorithm
network. outperforms the machine learning algorithms and is consequently
chosen to classify the user reviews. The proposed system will
Table 1. Results thus take the output of this analysis and map it with the user’s
Algorithm Used Accuracy interests.
In the proposed system, the reviews are looked at holistically.
Bernoulli Naïve-Bayes 82.75% Breaking this review down based on multiple core properties may
result in a more in-depth and accurate classification. For instance,
Multinomial Naïve-Bayes 84.48% in a review about a tourist spot, extracting features like parking
availability, cleanliness, child-safety may prove to be helpful and
needs further exploration in the future.
Random Forest 84.60%
6. ACKNOWLEDGEMENTS
Convolutional Neural Network 94.40% We greatly acknowledge Amazon Co. and Kaggle for making the
dataset of the Amazon Product reviews openly available.
Recurrent Neural Network 94.56%

Thus, from the analysis it is observed that Recurrent Neural

50
International Journal of Computer Applications (0975 – 8887)
Volume 178 – No. 19, June 2019

[5] McCallum, A., & Nigam, K. (1998, July). A comparison of

event models for naive bayes text classification. In AAAI- 98
7. REFERENCES workshop on learning for text categorization (Vol. 752, No.
[1] Staab, S., Werthner, H., Ricci, F., Zipf, A., Gretzel, U., 1, pp. 41-48).
Fesenmaier, D. R., ... & Knoblock, C. (2002). Intelligent
systems for tourism. IEEE Intelligent Systems, (6), 53-64 [6] Berka, T., & Plößnig, M. (2011). Designing recommender
systems for tourism. Proceedings of ENTER 2011, 26-28.
[2] Debnath, S., Ganguly, N., & Mitra, P. (2008, April).
Feature weighting in content-based recommendation [7] Kibriya, A. M., Frank, E., Pfahringer, B., & Holmes, G.
system using social network analysis. In Proceedings of (2004, December). Multinomial naïve bayes for text
the 17th international conference on World Wide Web categorization revisited. In Australasian Joint Conference on
(pp. 1041- 1042). ACM. Artificial Intelligence (pp. 488-499). Springer, Berlin,
Heidelberg.
[3] Cantador, I., Bellogín, A., & Vallet, D. (2010,
September). Content-based recommendation in social [8] Kanakaraj, M., & Guddeti, R. M. R. (2015, February).
tagging systems. In Proceedings of the fourth ACM Performance analysis of Ensemble methods on Twitter
conference on Recommender systems (pp. 237-240). sentiment analysis using NLP techniques. In Proceedings of
ACM. the 2015 IEEE 9th International Conference on Semantic
Computing (IEEE ICSC 2015) (pp. 169-170). IEEE.
[4] Ghose, A., Ipeirotis, P. G., & Li, B. (2012). Designing
ranking systems for hotels on travel search engines by [9] Barry, J. (2017). Sentiment Analysis of Online Reviews
mining user-generated and crowdsourced content. Using Bag-of-Words and LSTM Approaches. In AICS (pp.
Marketing Science, 31(3), 493-520. 272-274).

IJCATM : www.ijcaonline.org 51