0% found this document useful (0 votes)
36 views6 pages

Ijsrp p8252

Uploaded by

saxenaisha2612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views6 pages

Ijsrp p8252

Uploaded by

saxenaisha2612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328935798

Twitter Sentiment Analysis Using Support Vector Machine and K-NN


Classifiers

Article in International Journal of Scientific and Research Publications · October 2018


DOI: 10.29322/IJSRP.8.10.2018.p8252

CITATIONS READS

15 1,435

1 author:

Naw Naw
DigTech ASEAN
15 PUBLICATIONS 66 CITATIONS

SEE PROFILE

All content following this page was uploaded by Naw Naw on 30 June 2020.

The user has requested enhancement of the downloaded file.


2018
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 407
ISSN 2250-3153

Twitter Sentiment Analysis Using Support Vector


Citation: Machine and K-NN Classifiers
Naw, Naw. (2018). Twitter Sentiment Analysis Using Support
Vector Machine and K-NN Classifiers. International Journal of
Scientific and Research Publications (IJSRP). 8. Naw Naw
10.29322/IJSRP.8.10.2018.p8252.

and f1-score.
DOI: 10.29322/IJSRP.8.10.2018.p8252
http://dx.doi.org/10.29322/IJSRP.8.10.2018.p8252 2) RELATED WORD
Abstract— At the present time, social media has become an AaratiPatil and SrinivasaNarasimhaKini [1] proposed
important popular communication medium among all online suffers. Evolution of Social Media from the era of Information
Twitter is one of the most popular social networking services where Retrieval. Some insightful information that data scientists and
thoughts and opinions about various aspects and activities can be social users are showing interest in sentiment analysis of social
shared by the millions of users. Social media websites are rich sources media data is got. Different approaches have been implemented
of data for opinion mining. Such data can be applied for sentiment to automatically detect sentiment on texts [2]. An active
analysis. Sentiment analysis is the study of human behavior by research on Sentiment analysis on selective micro blogging
extracting user opinion and emotion form plain text. Among machine sites has explored in [3].
learning techniques, Support Vector Machine (S.V.M) classifier and
Barbosa and Feng [4] presented robust sentiment detection on
K-Nearest Neighbour (K-N.N) classifier is used in this system. The
system provides the analytical results of education, business, crime and Twitter from biased and noisy data. The subjectivity of social
health for Educational Authorities, Economists, Government media messages based on traditional features with the inclusion
Organization's needs and Health. And then, the system predicts the of some social media site specific clues such as retweets, hash
conditions of selected ASEAN countries (Malaysia, Singapore, tags, links, uppercase words, emoticons, and exclamation and
Vietnam and Myanmar) according to the tweets. In this system, question marks has been classified. Further, Agarwal, Xie,
accuracy, precision, recall and f1-score is also compared by using these Vovsha, Rambow and Passoeau introduced a Part-Of-Speech
two classifiers. (P.O.S) specific prior polarity features and a tree kernel to
obviate the need for tedious feature engineering.
Index Terms—K-NN, Opinion Mining, Sentiment Analysis, SVM, Agarwal et al. [5] approached the task of mining sentiment
Twitter.
form twitter, as a 3-way task of classifying sentiment into
positive, negative and neutral classes. They experimented with
1) INTRODUCTION three types of models: unigram model, a feature based model
Nowadays, social media gives the very large effect to the and a tree kernel based model. For the tree kernel based model
digital improvement in terms of global communications. The they designed a new tree representation for tweets. The feature
emergence of social media has provided a place for web users to based model that uses 100 features and the unigram model uses
share their thoughts and express their opinions on different over 10,000 polarity of words with their parts-of-speech tags are
topics in an event. Micro blogging has become a very popular most important for the classification task. The tree kernel based
communication tool among Internet users. Millions of messages model outperformed the other two.
are appearing daily in popular web sites that provide services
for micro blogging such as Facebook, twitter, Tumblr and so on. 3) SYSTEM DESIGN OVERVIEW
Twitter is an online social networking medium, popular since The system design is mainly composed of two parts such as
2006, where registered users share or post messages under 140 Training and Testing. In training phase, the system trains the
characters known as tweets. input raw dataset of tweets with classifier model. In testing
Sentiment Analysis or opinion mining is the computational phase, the system crawls the real tweets of Education, Business,
study of the opinions, attitudes and emotions of the entity. The Crime and Health data and then classifies the positive, negative
entity may describe and individual, event or topic. The topic is or neutral class based on Classifier Model. In the system design,
likely to be a review. The system is developed to analyze there are four main components. They are preprocessing,
Educational Rate, Business Rate, Health Rate and Crime Rate feature selection, feature extraction and classification. The
occurred in Malaysia, Singapore, Vietnam and our country, overall system design is illustrated in Fig. 1.
Myanmar according to the tweets. As a first step, the system Firstly, tweets about Education, Business, Crime and Health
are extracted from Twitter. The language is English using
crawls the real time data as the input from Twitter. As a second
Twitter Streaming API. The system needs to perform
step, sentiment analysis is implemented by using the crawled
preprocessing step. The preprocessing step performs
data. The system needs to build classifier model (Support
transformation, negation handling, tokenization, filtering and
Vector Machine and K-Nearest Neighbour) to implement normalization. And then, features are selected by comparing
sentiment analysis. Finally, the system outputs the percentage with Knowledge Base. After that, meaning features are
score of these sectors and displays according to their scores by extracted from Term Frequency-Inverse Document Frequency
using visualization techniques. The performance of (T.F-I.D.F). And then, features are selected as the input features
classification is also analyzed using accuracy, precision, recall of classification. At last, the system builds Classifier Model

http://dx.doi.org/10.29322/IJSRP.8.10.2018.p8252 www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 408
ISSN 2250-3153

(Support Vector Machine and K-Nearest Neighbour) in order to


perform the training process.

Training Testing
Data Data

Fig. 2. Sample Tweets about Education


Transformation Transformation a) Transformation
The following steps are performed in Transformation step. In
Negation Negation general, a clean tweet should not contain URLs, hashtags (i.e,
Handling Handling #studying) or mentions (i.e. @Irene).
Firstly, the tweets extracted from twitter are converted from
upper case to lower case, replaced URLs with generic word
Tokenization Tokenization URL, replaced @username with generic word AT_USER,
replaced #hashtag with the exact same word without the hash,
Filtering Filtering removed punctuations at the start and end of the tweets and
replaced multiple whitespaces with a single whitespace. The
Normalization Normalization resultant tweets from Transformation step are shown in Table 1.
Table 1. Tweets after Performing Transformation Step
Step Tweet
Knowledge Base Transformation AT_USER AT_USER want the
Features Features
rag picker children to get into
school trust me bhowapur govt
school is not worth going plz do
Feature Extraction Feature Extraction
something
using TF-IDF using TF-IDF
b) Negation Handling
Negations are those words which affect the sentiment
Classifier Classifier Model orientation of other words in a sentence. Examples of negation
words include not, no, never, cannot, should not, would not, etc.
Negation handling is an automatic way of determining the
Positive Neutral Negative scope of negation and inverting the polarities of opinionated
words that are actually affected by a negation [7]. The resultant
Fig. 1. The System Design tweets after performing negation handling step are shown in
In Testing phase, the real data about Education, Business, Table 2.
Crime and Health is crawled from Twitter as the input of the Table 2. Tweets after Performing Negation Handling Step
system. And then, Preprocessing, Feature Selection and Feature Step Tweet
Extraction are also performed like Training phase. The output Negation Handling AT_USER AT_USER want the
features are executed in order to classify the positive or negative rag picker children to get into
or neutral of class based on Classifier Model. And then, the school trust me bhowapur govt
system displays according to their scores by using visualization school is not not_worth not_going
techniques. The system also compared the performance of these not_plz not_do not_something
two classifiers in accuracy, precision, recall and f1-score. c) Tokenization
1) Preprocessing Tokenization is the act of breaking up a sequence of strings
Data preprocessing is an important step in the data mining into pieces such as words, keywords, phrases, symbols and
process. The phrase "garbage in, garbage out" is particularly other elements called tokens [8].
applicable to data mining and machine learning projects. If The system tokenizes the uniformed sentence from Negation
there is much irrelevant and redundant information or noisy and Handling Step which got into smaller components (unigram) as
unreliable data, then knowledge discovery during the training shown in Table 3. These resultants words become the input for
phase is more difficult. Data preprocessing is the most the next preprocessing.
important phase of a machine learning project, especially in Table 3. Features after Performing Tokenization Step
computational biology [6]. Sample tweet about education is Step Features
shown in Fig. 1. Tokenizati want, the, rag, picker, children, to, get, into,
on school, trust, me, bhowapur, govt, school, is,
not, not_worth, not_going, not_plz, not_do,
not_something
d) Filtering
Stop words are words which are filtered out before or after
processing of natural language data (text). A stop word is a

http://dx.doi.org/10.29322/IJSRP.8.10.2018.p82XX www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 409
ISSN 2250-3153

commonly used word (such as "the", "a", "an", "in") that a There are two main approaches to sentiment classification:
search engine has been programmed to ignore, both when lexicon-based and machine-learning. A lexicon-based approach
indexing entries for searching and when retrieving them as the tokenizes data into individual words which are checked with a
result of a search query [9]. The resultant features from Filtering sentiment lexicon containing a polarity value for individual
step are shown in Table 4. words. The sum of the polarities is passed to an algorithm that
Table 4. Features about Performing Filtering Step determines the overall polarity of the sentence. A
Step Features machine-learning approach utilizes a labeled training set to
Filtering want, rag, picker, children, get, school, adapt a classifier to the data domain of the training set. The
trust, bhowapur, govt, school, not, trained classifier can then predict the result of the problem and
not_worth, not_going, not_plz, not_do, the success rate of the prediction depends on how well the
not_something problem is contained within the same domain.
There are three types of machine learning algorithms:
e) Normalization supervised learning, unsupervised learning and reinforcement
In the Normalization step, lemmatization is performed. learning. Among them, this system uses supervised learning
Lemmatization in linguistics is the process of grouping together approach. This algorithm consists of the target or outcome
the inflected forms of a word so they can be analyzed as a single variable which is to be predicted from a given set of predictors.
item, identified by the word's lemma, or dictionary form [10]. Using these set of variables, a function that map inputs to
After normalization step is performed, the root words are got desired outputs is generated. The training process continues
as shown Table 5 and they are used for feature extraction step. until the model achieves a desired level of accuracy on the
Table 5. Features about Performing Normalization Step training data. Examples of Supervised Learning are decision
Step Features tree, K-Nearest Neighbour (K-N.N), Support Vector Machine
Normalization hrdministry, pmoindia, want, rag, picker, (S.V.M), Naïve Bayes (N.B) , Maximum Entropy(MaxEnt) and
children, get, school, trust, bhowapur, so on. In this system, S.V.M and K-N.N classifier are used.
govt, school, not, not_worth, not_go,
a) Support Vector Machine (One-Versus-One)
not_plz, not_do, not_something
SVMs are often employed for binary sentiment detection
2) Feature Selection because they are binary classifiers. In order to perform
multi-class classification, the problem needs to be transformed
After preprocessing step, features are selected by comparing
into a set of binary classification problems. There are two
with Knowledge Base to improve accuracy. Several words
approaches to do this: One vs. Rest Approach (O.V.R) and One
related to hundreds education, business, crime features are
vs. One Approach (O.V.O).
collected and added to the Knowledge Base. Finally, the system
In the system, OVO strategy is used. In the O.V.O strategy,
selects features from Knowledge Base. In this way, the system
one trains K(K-1)/2 binary classifiers for a K-way multi-class
can get the essential features for the system as shown in Table 6
problem; each receives the samples of a pair of classes form the
and performs the best accuracy.
original training set, and must learn to distinguish these two
Table 6. Features about Performing Feature Selection Step
classes. At prediction time, a voting scheme is applied: all
Step Features
K(K-1)/2 classifiers are applied to an unseen sample and the
Feature Selection Want, school, trust, not, class that got the highest number of "+1" predictions gets
not_worth, not_go predicted by the combined classifier. Like O.V.R, O.V.O
3) Feature Extraction suffers from ambiguities in that some regions of its input space
The system trains Support Vector Machine and K-NN may receive the same number of votes [11].
classifier based on T.F-I.D.F (Term Frequency-Inverse b) K-Nearest Neighbour (K-N.N)
Document Frequency) weighted word frequency. TF is how K-NN algorithm is one of the simplest classification
frequently a word occurs in a document. IDF decreases the algorithms and it is one of the most used learning algorithms.
weight for commonly used words and increases the weight for K-NN is a non-parametric, lazy learning algorithm. Its goal is to
words that are not used very much in a collection of documents. use a database in which the data points are separated into several
This can be combined with term frequency to calculate a classes to predict the classification of a new sample point [12].
term's tf-idf, the frequency of a term adjusted for how rarely it is In the case of classification, the output is class membership
used. When the feature extraction is performed using T.F-I.D.F, (the most prevalent cluster may be returned), the object is
the system selects features as the input features of classification. classified by a majority vote of its neighbours, with the object
4) Classification being assigned to the class most common among its k nearest
Text classification models are used to categorize text into neighbours. This rule simply retains the entire training set
organized groups. Text is analyzed by a model and then the during learning and assigns to each query a class represented by
appropriate tags are applied based on the content. Machine the majority label of its k-nearest neighbours in the training sets.
learning models that can automatically apply tags for The Nearest Neighbour rule (N.N) is the simplest form of
classification are known as classifiers. K-NN when K=1. Given an unknown sample and a training set,
Classifiers can't just work automatically; they need to be all the distances between the unknown sample and all the
trained to be able to make specific predictions for texts. Once samples in the training set can be computed. The distance with
enough texts have been trained, the classifier can learn from the smallest value corresponds to the sample in the training set
those associations and begin to make predictions with new closest to the unknown sample.
texts. Therefore, the unknown sample may be classified based on
the classification of the nearest neighbor. The K-N.N is an easy

http://dx.doi.org/10.29322/IJSRP.8.10.2018.p8252 www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 410
ISSN 2250-3153

algorithm to understand and implement, and a powerful tool


that there is at our disposal for sentiment analysis. 60
50
4) PERFORMANCE ANALYSIS
40 positive
The system calculates the performance of the system. It
performs accuracy, precision, recall and f1-score of the two 30 neutral
classifiers. 20 Negative
Accuracy is not only the metric for evaluating the
effectiveness of a classifier. The system calculates the accuracy 10
of S.V.M and K-N.N Classifiers. It is calculated by number of 0
correctly selected positive, negative and neutral words divided Myanmar Singapore Malaysia Vietnam
by total number of words resent in the corpus.
Precision measures the exactness of a classifier. Recall
Fig. 5. Graphical Analysis about Crime
measures the completeness, or sensitivity of a classifier.
The system calculates them by using sklearn libraries based
60
on Support Vector Machine and K-Nearest Neighbor Classifier.
F1-score is the harmonic average of the precision and recall, 50
where an F1 score reaches its best value at 1 (perfect precision 40
and recall) and worst at 0. positive
30 neutral
5) EXPERIMENTAL RESULTS 20 Negative
The crawled training dataset consists of 3484 tweets about 10
education, 5455 tweets about business, 2460 tweets about crime
0
and 8078 tweets about health. These crawled data are
preprocessed in transformation, negation handling, Myanmar Singapore Malaysia Vietnam
tokenization, filtering and normalization.
After that they system selects meaning features by comparing Fig. 6. Graphical Analysis about Health
with knowledge base. And then, the system performs feature
extraction step. The output features are the input features of 6) PERFORMANCE COMPARISON
Support Vector Machine and K-Nearest Neighbour classifiers. The system is aimed to perform the accuracy, precision,
For testing data, tweets about education, business, crime and recall and f1-score of Support Vector Machine and K-Nearest
health are extracted from a particular Twitter account after Neighbour Classifier on the same training dataset.
getting prior permission.
Table 7. Performance Comparison about Education
Training Support Vector K-NN
60
Data 2788 Machine
50 Testing
40 positive Data 696
30 neutral
20 Negative Accuracy 0.7446197991391679 0.6958393113342898
10 Precision 0.7130609476214108 0.7780602580671679
0 Recall 0.7080479012071043 0.6339499441631286
re
ar

m
sia

F1-Score 0.7080479012071043 0.6339499441631286


na
nm

po

ay

et
a

al
ya

ng

Vi
M
M

Si

Fig. 2. Graphical Analysis about Education

60 Table 8. Performance Comparison about Business


Training Support Vector K-NN
50
Data 4364 Machine
40 positive Testing
30 neutral
Data 1091
20 Negative
Accuracy 0.7241063244729606 0.7057745187901008
10 Precision 0.6727147115713413 0.7585090117040704
0 Recall 0.6740008345575266 0.5960035137416786
Myanmar Singapore Malaysia Vietnam F1-Score 0.7239724741913184 0.6819787503355964

Fig. 4. Graphical Analysis about Business Table 9. Performance Comparison about Crime

http://dx.doi.org/10.29322/IJSRP.8.10.2018.p82XX www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 411
ISSN 2250-3153

Training Support Vector K-NN [4] Barbosa, L., Feng, J: Robust sentiment detection on
Data 1968 Machine Twitter from biased on noisy data, In: Proceedings of
Testing COLING, (2010) 3644.
Data 492 [5] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, R.
Passonneau, "Sentiment Analysis of Twitter Data", In
Accuracy 0.7215447154471545 0.5873983739837398 Proceedings of the ACL 2011 Workshop on Languages in
Social Media, 2011, pp. 30-38.
Precision 0.694264733332539 0.6301859690017584 [6] https://en.wikipedia.org/wiki/Data_pre-processing.
Recall 0.6764180216193556 0.5892023932506887 [7] https://www.researchgate.net/publication/314424838_
F1-Score 0.7175716931918386 0.5992940183025018 Negation_Handling_in_Sentiment_Analysis_at_Senten
ence_Level
[8] https://www.techopedia.com/definition/13698/tokeniza
Table 10. Performance Comparison about Health tion
Training Support Vector K-NN [9] https://www.geeksforgeeks.org/removing-stop-words-n
Data 6463 Machine ltk-python/
Testing [10] htts://en.m.wikipedia.org/wiki/Lemmatisation
Data 1615 [11] https://[email protected]/a-quick-indroduct
ion-to-k-nearest-neighbors-algorithm-62214cea29c7
Accuracy 0.709389492039024 0.5920848023027024 [12] https://sadanand-singh.github.io/posts/svmpython/
Precision 0.5842024027422439 0.4703402840230202
Recall 0.5700240203207324 0.4028508204830833
F1-Score 0.6409402930294028 0.5084348028302832

7) CONCLUSION
Support Vector Machine and K-Nearest Neighbour
classifiers are performed on twitter to classify about Education,
Business, Crime and Health. The system is intended to measure
the impact of ASEAN citizens' social media usage behavior.
The main purpose of the system is to understand how to perform
social media sentiment analytics on big data environment by
applying machine learning approach of Artificial Intelligence
(A.I). The system is developed for analyzing Business Rate,
Crime Rate, Educational Rate and Health Rate occurred in
Malaysia, Singapore, Vietnam and our country, Myanmar. The
rate of change of these sectors can be compared by looking at
these conditions. The system can be expected to contribute a lot
of advantages for the Ministry of Education, Commerce, Home
Affairs and Health in each country's government.

ACKNOWLEDGEMENT
Firstly, I would like to appreciate Dr. Aung Win, Rector,
University of Technology (Yatanarpon Cyber City), for her
vision, chosen, giving valuable advices and guidance for
preparation of this article. And then, I wish to express my
deepest gratitude to my teacher Dr. Hninn Aye Thant,
Professor, Department of Information Science and Technology,
University of Technology (Yatanarpon Cyber City), for her
advice. Last but not least, many thanks are extended to all
persons who directly and indirectly contributed towards the
success of this paper.

REFERENCES
[1] AaratilPatil, SrinivasaNarasimhaKini: Evolution of Social
Media from the era of Information Retrieval, International
Journal Science and Research (IJSR), 4 (14) (2015)
2326-2331.
[2] Bo P., Lillian L., and Shivakumar V.: Sentiment
Classification Using Machine Learning Techniques, In
Proceedings of the Conference on Empirical Methods in
Natural Language Processing (2002).
[3] Bhayani A., Huang R., L: Twitter sentiment classification
using distant supervision, CS224N Project Report,
Stanford (2009).

http://dx.doi.org/10.29322/IJSRP.8.10.2018.p8252 www.ijsrp.org

View publication stats

You might also like