Location Prediction On Twitter Using Machine Learning Techniques
Location Prediction On Twitter Using Machine Learning Techniques
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Location prediction of users from online social Tweet location refers to the region from where the tweet is
media brings considerable research these days. Automatic posted by user. By construing tweet location, one can get
recognition of location related with or referenced in records tweet person’s mobility. Usually home location collected from
has been investigated for decades. As a standout amongst the user profile, whereas tweet location can be arrived from user’s
online social network organization, Twitter has pulled in an geo tag. Because of the first perspectives on tweet location,
extensive number of users who send a millions of tweets on POIs are comprehensively received as representation of tweet
regular schedule. Because of the worldwide inclusion of its regions.
users and continuous tweets, location prediction on Twitter Mentioned Location:
has increased noteworthy consideration in these days. Tweets, When composing tweets, user may make reference to the
the short and noisy and rich natured texts bring many names of a few locations in tweet texts. Referenced location
challenges in research area for researchers. In proposed prediction may encourage better understanding of tweet
framework, a general picture of location prediction using content, and advantage applications like recommendation
tweets is studied. In particular, tweet location is predicted systems, location based advertisements, health monitoring,
from tweet contents. By outlining tweet content and contexts, and polling etc. In this study, we include two sub-modules of
it is fundamentally featured that how the issues rely upon mentioned location:
these text inputs. In this work, we predict the location of user First one is recognizing the mentioned location in tweet text,
from the tweet text exploiting machine learning techniques which can be achieved by extracting text content from a tweet
namely naïve bayes, Support Vector Machine and Decision that refers to geography names. Second one is identifying the
Tree. location from tweet text by solving them to entries in a
geographical database.
Key Words: Social media, Twitter, Tweets, location prediction,
Naive Bayes, Support Vector Machine, Decision Tree, Machine 2. RELATED WORK
Learning
Many existing techniques have been studied by the
1.INTRODUCTION researchers on location prediction problem from tweet content
and social media content, few of them are discussed below.
Users may post explicitly their location on the tweet text they In [1], the author refers to the problem of finding location
post, whereas in certain cases the location may be available from social media content. The author from [1] and [2]
implicitly by including certain relevant criteria. Tweets are not motivated by Term frequency (TF) and inverse document
a strongly typed language, in which users may post casual frequency (IDF), they arrived Inverse City Frequency (ICF)
with emotion images. Abbreviated form of text, misspellings, and Inverse Location Frequency (ILF) respectively. They
and extra characters of emotional words makes tweet texts raked the features by using these frequency values and TF
noisy. The techniques applied for normal documents are not then by TF values. From this they arrived that local words
suited for analysing tweets. The character limitations of tweets spread in document in few places and have high ICF and ILF
about 140 characters may make the tweet uneasy to values.
understand, if the tweet context is not studied. Han et al [3] in their work, they approached model for
The issue of location prediction related named as geolocation identifying local words indicative or used in certain locations
precition is examined for Wikipedia and web page documents. only. They aimed to identify automatically by ranking the
Entity recognition from these formal documents has been local words by their location, and they find their degree of
researched for years. Different types of content and context association of location words associated to particular location
handling on these documents are also studied extensively. or cities.
However, the location prediction problem from twitter
Li et al. [4] proposed multiple locations profiling (MLP)
depends highly on tweet content. Users living in specific
model to arrive user location accurately by finding the
regions, locations may examine neighborhood tourist spots,
probability based on Bernoulli distribution. Their work
landmarks and buildings and related events.
represents that users home location can be predicted
Home Location:
accurately using this model. The author used multinomial
User’s residential address given by user or location given by
distribution to estimate probability of tweet versus the venue
user on account creation is considered as home location.
name from each location.
Home location prediction can be used in various application
namely recommendation systems, location based Mahmud et al. proposed classification model for predicting
advertisements, health monitoring, and polling etc. Home location, they improved the accuracy of prediction by first
location can be specified as administrative location, predicting regions and then city. They registered the
geographical location or co-ordinates. movement of users using classifier models, if the user travels
Tweet Location: for a certain period, then they are registered to improve the
accuracy of prediction. The authors considered the person is
travelling when the location distance for two tweets is more
than 100 miles.
Most of the techniques used in existing works are machine
learning, whereas few works in deep learning also proposed.
Miura et al. [6] on his work used neural network is
implemented for twitter location prediction. The author
classified tweet or user using neural networks and they
integrated metadata with tweet texts and trained the model.
Their model achieved around 41 percentage of accuracy on
predictions.
3. PROPOSED WORK
Live stream of twitter data is collected as dataset using Figure 2: Extract Dataset from live twitter for locations
authentication keys. The aim of proposed system is to predict Chennai, Mumbai, Kerala
the user location from twitter content considering user home
The implementation methodology of the proposed work is
location, tweet location and tweet content. To handle this we
split as following modules namely, data collection and
used three machine learning approaches to make prediction
extraction, data pre-processing, applying machine learning
easier and finding the best model amongst them. Figure 1,
techniques and comparing them.
represents the overall architecture of the proposed system with
methodology modules represented. Data Collection and extraction
Live tweet stream from twitter for keyword “apple” is
collected and stored in 'twitter.json’ file. Live twitter data can
be collected by registering a consumer_key, consumer_secret,
access_token, access_token_secret for authentication and
collecting live stream of tweets. We have collected more than
1000 tweets of particular keyword such as ‘Chennai, Mumbai
and Kerala’. The information extracted from live includes
tweetid, name, screen_name, tweet_text, HomeLocation,
TweetLocation, MentionedLocation, Lvalue.
Data from 'twitter.json’ file is read and extracted tweetid,
name, screen_name, tweet_text, HomeLocation,
TweetLocation, MentionedLocation are extracted. Tweet text
is compared with natural language tool kit package available
in python to extract data from json file to csv is done here.
Data Pre-processing
Data pre-processing include the following steps,
Figure 1: System Architecture of Proposed System
1. Extra characters are removed from tweet text.
DATASET DETAILS
2. Capitalize all words to find for geo location
Live tweet stream from twitter for keyword “apple” is
collected and stored in 'twitter.json’ file. Live twitter data can 3. Remove the tweet if user home location not mentioned
be collected by registering a consumer_key, consumer_secret, 4. Mention home location in tweet location, if user tweet
access_token, access_token_secret for authentication and location is null
collecting live stream of tweets. We have collected more than
1000 tweets of particular keywords such as ‘Chennai, 5. Removes tweets if no location is mentioned in tweet text.
Mumbai, Kerala’. Final extract geodata from tweet text. Last step is to assign
The information extracted from live includes tweetid, name, integer value to the locations, for example Chennai—1,
screen_name, tweet_text, HomeLocation, TweetLocation, Mumbai—2, Kerala—3. Lcoder is used to assign location as
MentionedLocation, Lvalue. integer value.
Primary analysis was a basic processing of the text of the The work is implemented using Python programming, with
tweets. This was done by merging the collected tweets for a few libraries used are scikit learn, numpy, pandas, matplotlib,
given user into a single “document” and analysing that. geography.
Naïve Bayes classification
Naïve Bayes classifier is the most popular and simple
classifier model used commonly. This model finds the
posterior probability based on word distribution in the
document. Naïve Bayes classifier work with Bag Of Words
(BOW) feature extraction model, which do not consider the
position of word inside the document. This model used Bayes (user["features"]["id"],user["features"]
Theorem for prediction of particular label from the given ["name"],user["features"]["screen_name"],user["features"]
feature set. The dataset is split into trainset and test set. Upon ["tweets text"],user["features"]["Home
test set, NB_model is applied to find the location prediction. Location"],user["features"]["Tweet Location"])
Support Vector machine Instead of attaching the geo-tags to tweets, user may
sometimes reveal the relevant location by specifying their
Support vector machine is one of most common used
name or landmarks in the tweets. During pre-processing the
supervised learning techniques, which is commonly used for
location names are important, thus we capitalize every words
both classification and regression problems. The algorithm
of tweet text to identify the geo-locations. Geo location can be
works in such a way that each data is plotted as point in n-
processed in two ways, one is through recognition, label the
dimensional space with the feature values represents the
text and if recognized then they are converted to location.
values of each co-ordinate.
Next is through disambiguation, which makes the entries as
Decision Tree identified location.
Decision tree is the learning model, which utilizes
classifications problem. Decision tree module works by
splitting the dataset into minimum of two sets. Decision tree’s
internal nodes indicates a test on the features, branch depicts
the result and leafs are decisions made after succeeding
process on training.
Decision Tree works as follows
Decision tree starts with all training instances linked
with the root node
It splits the dataset into train set and test set.
It uses information to gain and chooses attributes to
label the each node. Subsets made contain
information with a similar feature attribute.
Above process is repeated till in all subset until leafs
get generated in tree.
The tree is constructed in such a way that no root to leaf node
path contains same attribute twice. This is done repeatedly to
construct every subtree on the training instances, which is Figure: 4 Use case diagram of Location prediction
classified down through the path in the tree. For every record
in the dataset, class label prediction problem starts with root The following table shows predicted result using our proposed
of the tree. The root attributes are checked for the given machine learning algorithms. The values are represented in
record and then it checks next record attributes. This process numeric represents location Chennai=1, Mumbai=2 and
continues till the value next node to go. The sample decision Kerala=3.
tree applied is depicted in below figure, Figure 3. ID Decision Tree SVM Naive Bayes
1 1 1 1
2 2 2 1
3 0 0 0
4 2 2 2
5 1 1 1
6 0 0 0
7 0 0 0
8 2 2 2
9 1 1 1
10 1 1 2
Table1: Predicted results using different machine learning
algorithms
4. RESULTS AND DISCUSSIONS
The pre-processed dataset are taken for machine learning
Figure 3: Decision Tree model process, we applied Naïve Bayes, SVM algorithm and
Implementation done as represented in the use case diagram Decision Tree on the dataset. The dataset is given 80% as
given the figure 4. training set and 20% as test set, we predicted the location and
compared accuracy under following chart, Figure 4.
The extracted features from the tweet are mentioned below
code snippet.
The following table shows the performance evaluation of a challenging problem. The tweet text nature and number of
three machine learning algorithm namely Naive Bayes, characters limitation make it hard to understand and analyze.
Support Vector machine (SVM) and Decision Tree. The In this work, we have predicted the geolocations of user from
evaluation parameters showed in the table are Accuracy of their tweet text using machine learning algorithms. We have
prediction. The table clearly depicts that decision tree implemented three algorithms to show the better performed
outperforms the other algorithms in terms of efficiency one, which is suitable for geolocation prediction problem. Our
experiment analysis concluded that decision tree is suitable
in accuracy.
for tweet text analysis and location prediction problem.
Algorithm Accuracy
REFERENCES
Naive Bayes 43.67
SVM 86.78 [1] Han, Bo & Cook, Paul & Baldwin, Timothy. (2012).
Geolocation Prediction in Social Media Data by Finding
Decision Tree 99.96 Location Indicative Words. 24th International Conference
on Computational Linguistics - Proceedings of COLING
Table 2: Accuracy comparison of machine learning 2012: Technical Papers. 1045-1062.
algorithms [2] Ren K., Zhang S., Lin H. (2012) Where Are You Settling
Down: Geo-locating Twitter Users Based on Tweets and
The following table shows the error rates in prediction. There Social Networks. In: Hou Y., Nie JY., Sun L., Wang B.,
are four error types calculated are Mean Absolute Error Zhang P. (eds) Information Retrieval Technology. AIRS
(MAE), Mean Squared Error (MSE), Root Mean Square Error 2012. Lecture Notes in Computer Science, vol 7675.
(RMSE) and R-squared. Springer, Berlin, Heidelberg
[3] Han, Bo & Cook, Paul & Baldwin, Timothy. (2014).
Error Types Naive Bayes SVM Decision Text-Based Twitter User Geolocation Prediction. The
Tree Journal of Artificial Intelligence Research (JAIR). 49.
10.1613/jair.4200.
MAE 1.06 0.13 0.02
[4] Li, Rui & Wang, Shengjie & Chen-Chuan Chang, Kevin.
MSE 2.31 0.13 0.02 (2012). Multiple Location Profiling for Users and
Relationships from Social Network and Content.
RMSE 1.52 0.36 0.04 Proceedings of the VLDB Endowment. 5.
10.14778/2350229.2350273.
R-Squared 0.01 0.88 1.00 [5] Jalal Mahmud, Jeffrey Nichols, and Clemens Drews.
Table 3: Error rates in prediction 2014. Home Location Identification of Twitter Users.
ACM Trans. Intell. Syst. Technol. 5, 3, Article 47 (July
The below figure, Figure 5 shows the experimental results 2014), 21 pages. DOI: http://dx.doi.org/10.1145/2528548
achieved using three machine learning algorithms. Naive [6] Miura, Yasuhide, Motoki Taniguchi, Tomoki Taniguchi
bayes achieves around 40% of accuracy, SVM algorithm and Tomoko Ohkuma. “A Simple Scalable Neural
achieves around 85% of accuracy and Decision Tree achieves Networks based Model for Geolocation Prediction in
around 99% accuracy. Thus from this work, we can conclude Twitter.” NUT@COLING (2016).
that Decision Tree is the suitable algorithm for location
prediction problem in tweet texts.
5. CONCLUSIONS
Three locations are considered from twitter data, namely
home location, mentioned location and tweet location. When
the twitter data is considered, geolocation prediction becomes