Ijsrp p8252
Ijsrp p8252
net/publication/328935798
CITATIONS READS
15 1,435
1 author:
Naw Naw
DigTech ASEAN
15 PUBLICATIONS 66 CITATIONS
SEE PROFILE
All content following this page was uploaded by Naw Naw on 30 June 2020.
and f1-score.
DOI: 10.29322/IJSRP.8.10.2018.p8252
http://dx.doi.org/10.29322/IJSRP.8.10.2018.p8252 2) RELATED WORD
Abstract— At the present time, social media has become an AaratiPatil and SrinivasaNarasimhaKini [1] proposed
important popular communication medium among all online suffers. Evolution of Social Media from the era of Information
Twitter is one of the most popular social networking services where Retrieval. Some insightful information that data scientists and
thoughts and opinions about various aspects and activities can be social users are showing interest in sentiment analysis of social
shared by the millions of users. Social media websites are rich sources media data is got. Different approaches have been implemented
of data for opinion mining. Such data can be applied for sentiment to automatically detect sentiment on texts [2]. An active
analysis. Sentiment analysis is the study of human behavior by research on Sentiment analysis on selective micro blogging
extracting user opinion and emotion form plain text. Among machine sites has explored in [3].
learning techniques, Support Vector Machine (S.V.M) classifier and
Barbosa and Feng [4] presented robust sentiment detection on
K-Nearest Neighbour (K-N.N) classifier is used in this system. The
system provides the analytical results of education, business, crime and Twitter from biased and noisy data. The subjectivity of social
health for Educational Authorities, Economists, Government media messages based on traditional features with the inclusion
Organization's needs and Health. And then, the system predicts the of some social media site specific clues such as retweets, hash
conditions of selected ASEAN countries (Malaysia, Singapore, tags, links, uppercase words, emoticons, and exclamation and
Vietnam and Myanmar) according to the tweets. In this system, question marks has been classified. Further, Agarwal, Xie,
accuracy, precision, recall and f1-score is also compared by using these Vovsha, Rambow and Passoeau introduced a Part-Of-Speech
two classifiers. (P.O.S) specific prior polarity features and a tree kernel to
obviate the need for tedious feature engineering.
Index Terms—K-NN, Opinion Mining, Sentiment Analysis, SVM, Agarwal et al. [5] approached the task of mining sentiment
Twitter.
form twitter, as a 3-way task of classifying sentiment into
positive, negative and neutral classes. They experimented with
1) INTRODUCTION three types of models: unigram model, a feature based model
Nowadays, social media gives the very large effect to the and a tree kernel based model. For the tree kernel based model
digital improvement in terms of global communications. The they designed a new tree representation for tweets. The feature
emergence of social media has provided a place for web users to based model that uses 100 features and the unigram model uses
share their thoughts and express their opinions on different over 10,000 polarity of words with their parts-of-speech tags are
topics in an event. Micro blogging has become a very popular most important for the classification task. The tree kernel based
communication tool among Internet users. Millions of messages model outperformed the other two.
are appearing daily in popular web sites that provide services
for micro blogging such as Facebook, twitter, Tumblr and so on. 3) SYSTEM DESIGN OVERVIEW
Twitter is an online social networking medium, popular since The system design is mainly composed of two parts such as
2006, where registered users share or post messages under 140 Training and Testing. In training phase, the system trains the
characters known as tweets. input raw dataset of tweets with classifier model. In testing
Sentiment Analysis or opinion mining is the computational phase, the system crawls the real tweets of Education, Business,
study of the opinions, attitudes and emotions of the entity. The Crime and Health data and then classifies the positive, negative
entity may describe and individual, event or topic. The topic is or neutral class based on Classifier Model. In the system design,
likely to be a review. The system is developed to analyze there are four main components. They are preprocessing,
Educational Rate, Business Rate, Health Rate and Crime Rate feature selection, feature extraction and classification. The
occurred in Malaysia, Singapore, Vietnam and our country, overall system design is illustrated in Fig. 1.
Myanmar according to the tweets. As a first step, the system Firstly, tweets about Education, Business, Crime and Health
are extracted from Twitter. The language is English using
crawls the real time data as the input from Twitter. As a second
Twitter Streaming API. The system needs to perform
step, sentiment analysis is implemented by using the crawled
preprocessing step. The preprocessing step performs
data. The system needs to build classifier model (Support
transformation, negation handling, tokenization, filtering and
Vector Machine and K-Nearest Neighbour) to implement normalization. And then, features are selected by comparing
sentiment analysis. Finally, the system outputs the percentage with Knowledge Base. After that, meaning features are
score of these sectors and displays according to their scores by extracted from Term Frequency-Inverse Document Frequency
using visualization techniques. The performance of (T.F-I.D.F). And then, features are selected as the input features
classification is also analyzed using accuracy, precision, recall of classification. At last, the system builds Classifier Model
http://dx.doi.org/10.29322/IJSRP.8.10.2018.p8252 www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 408
ISSN 2250-3153
Training Testing
Data Data
http://dx.doi.org/10.29322/IJSRP.8.10.2018.p82XX www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 409
ISSN 2250-3153
commonly used word (such as "the", "a", "an", "in") that a There are two main approaches to sentiment classification:
search engine has been programmed to ignore, both when lexicon-based and machine-learning. A lexicon-based approach
indexing entries for searching and when retrieving them as the tokenizes data into individual words which are checked with a
result of a search query [9]. The resultant features from Filtering sentiment lexicon containing a polarity value for individual
step are shown in Table 4. words. The sum of the polarities is passed to an algorithm that
Table 4. Features about Performing Filtering Step determines the overall polarity of the sentence. A
Step Features machine-learning approach utilizes a labeled training set to
Filtering want, rag, picker, children, get, school, adapt a classifier to the data domain of the training set. The
trust, bhowapur, govt, school, not, trained classifier can then predict the result of the problem and
not_worth, not_going, not_plz, not_do, the success rate of the prediction depends on how well the
not_something problem is contained within the same domain.
There are three types of machine learning algorithms:
e) Normalization supervised learning, unsupervised learning and reinforcement
In the Normalization step, lemmatization is performed. learning. Among them, this system uses supervised learning
Lemmatization in linguistics is the process of grouping together approach. This algorithm consists of the target or outcome
the inflected forms of a word so they can be analyzed as a single variable which is to be predicted from a given set of predictors.
item, identified by the word's lemma, or dictionary form [10]. Using these set of variables, a function that map inputs to
After normalization step is performed, the root words are got desired outputs is generated. The training process continues
as shown Table 5 and they are used for feature extraction step. until the model achieves a desired level of accuracy on the
Table 5. Features about Performing Normalization Step training data. Examples of Supervised Learning are decision
Step Features tree, K-Nearest Neighbour (K-N.N), Support Vector Machine
Normalization hrdministry, pmoindia, want, rag, picker, (S.V.M), Naïve Bayes (N.B) , Maximum Entropy(MaxEnt) and
children, get, school, trust, bhowapur, so on. In this system, S.V.M and K-N.N classifier are used.
govt, school, not, not_worth, not_go,
a) Support Vector Machine (One-Versus-One)
not_plz, not_do, not_something
SVMs are often employed for binary sentiment detection
2) Feature Selection because they are binary classifiers. In order to perform
multi-class classification, the problem needs to be transformed
After preprocessing step, features are selected by comparing
into a set of binary classification problems. There are two
with Knowledge Base to improve accuracy. Several words
approaches to do this: One vs. Rest Approach (O.V.R) and One
related to hundreds education, business, crime features are
vs. One Approach (O.V.O).
collected and added to the Knowledge Base. Finally, the system
In the system, OVO strategy is used. In the O.V.O strategy,
selects features from Knowledge Base. In this way, the system
one trains K(K-1)/2 binary classifiers for a K-way multi-class
can get the essential features for the system as shown in Table 6
problem; each receives the samples of a pair of classes form the
and performs the best accuracy.
original training set, and must learn to distinguish these two
Table 6. Features about Performing Feature Selection Step
classes. At prediction time, a voting scheme is applied: all
Step Features
K(K-1)/2 classifiers are applied to an unseen sample and the
Feature Selection Want, school, trust, not, class that got the highest number of "+1" predictions gets
not_worth, not_go predicted by the combined classifier. Like O.V.R, O.V.O
3) Feature Extraction suffers from ambiguities in that some regions of its input space
The system trains Support Vector Machine and K-NN may receive the same number of votes [11].
classifier based on T.F-I.D.F (Term Frequency-Inverse b) K-Nearest Neighbour (K-N.N)
Document Frequency) weighted word frequency. TF is how K-NN algorithm is one of the simplest classification
frequently a word occurs in a document. IDF decreases the algorithms and it is one of the most used learning algorithms.
weight for commonly used words and increases the weight for K-NN is a non-parametric, lazy learning algorithm. Its goal is to
words that are not used very much in a collection of documents. use a database in which the data points are separated into several
This can be combined with term frequency to calculate a classes to predict the classification of a new sample point [12].
term's tf-idf, the frequency of a term adjusted for how rarely it is In the case of classification, the output is class membership
used. When the feature extraction is performed using T.F-I.D.F, (the most prevalent cluster may be returned), the object is
the system selects features as the input features of classification. classified by a majority vote of its neighbours, with the object
4) Classification being assigned to the class most common among its k nearest
Text classification models are used to categorize text into neighbours. This rule simply retains the entire training set
organized groups. Text is analyzed by a model and then the during learning and assigns to each query a class represented by
appropriate tags are applied based on the content. Machine the majority label of its k-nearest neighbours in the training sets.
learning models that can automatically apply tags for The Nearest Neighbour rule (N.N) is the simplest form of
classification are known as classifiers. K-NN when K=1. Given an unknown sample and a training set,
Classifiers can't just work automatically; they need to be all the distances between the unknown sample and all the
trained to be able to make specific predictions for texts. Once samples in the training set can be computed. The distance with
enough texts have been trained, the classifier can learn from the smallest value corresponds to the sample in the training set
those associations and begin to make predictions with new closest to the unknown sample.
texts. Therefore, the unknown sample may be classified based on
the classification of the nearest neighbor. The K-N.N is an easy
http://dx.doi.org/10.29322/IJSRP.8.10.2018.p8252 www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 410
ISSN 2250-3153
m
sia
po
ay
et
a
al
ya
ng
Vi
M
M
Si
Fig. 4. Graphical Analysis about Business Table 9. Performance Comparison about Crime
http://dx.doi.org/10.29322/IJSRP.8.10.2018.p82XX www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 8, Issue 10, October 2018 411
ISSN 2250-3153
Training Support Vector K-NN [4] Barbosa, L., Feng, J: Robust sentiment detection on
Data 1968 Machine Twitter from biased on noisy data, In: Proceedings of
Testing COLING, (2010) 3644.
Data 492 [5] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, R.
Passonneau, "Sentiment Analysis of Twitter Data", In
Accuracy 0.7215447154471545 0.5873983739837398 Proceedings of the ACL 2011 Workshop on Languages in
Social Media, 2011, pp. 30-38.
Precision 0.694264733332539 0.6301859690017584 [6] https://en.wikipedia.org/wiki/Data_pre-processing.
Recall 0.6764180216193556 0.5892023932506887 [7] https://www.researchgate.net/publication/314424838_
F1-Score 0.7175716931918386 0.5992940183025018 Negation_Handling_in_Sentiment_Analysis_at_Senten
ence_Level
[8] https://www.techopedia.com/definition/13698/tokeniza
Table 10. Performance Comparison about Health tion
Training Support Vector K-NN [9] https://www.geeksforgeeks.org/removing-stop-words-n
Data 6463 Machine ltk-python/
Testing [10] htts://en.m.wikipedia.org/wiki/Lemmatisation
Data 1615 [11] https://[email protected]/a-quick-indroduct
ion-to-k-nearest-neighbors-algorithm-62214cea29c7
Accuracy 0.709389492039024 0.5920848023027024 [12] https://sadanand-singh.github.io/posts/svmpython/
Precision 0.5842024027422439 0.4703402840230202
Recall 0.5700240203207324 0.4028508204830833
F1-Score 0.6409402930294028 0.5084348028302832
7) CONCLUSION
Support Vector Machine and K-Nearest Neighbour
classifiers are performed on twitter to classify about Education,
Business, Crime and Health. The system is intended to measure
the impact of ASEAN citizens' social media usage behavior.
The main purpose of the system is to understand how to perform
social media sentiment analytics on big data environment by
applying machine learning approach of Artificial Intelligence
(A.I). The system is developed for analyzing Business Rate,
Crime Rate, Educational Rate and Health Rate occurred in
Malaysia, Singapore, Vietnam and our country, Myanmar. The
rate of change of these sectors can be compared by looking at
these conditions. The system can be expected to contribute a lot
of advantages for the Ministry of Education, Commerce, Home
Affairs and Health in each country's government.
ACKNOWLEDGEMENT
Firstly, I would like to appreciate Dr. Aung Win, Rector,
University of Technology (Yatanarpon Cyber City), for her
vision, chosen, giving valuable advices and guidance for
preparation of this article. And then, I wish to express my
deepest gratitude to my teacher Dr. Hninn Aye Thant,
Professor, Department of Information Science and Technology,
University of Technology (Yatanarpon Cyber City), for her
advice. Last but not least, many thanks are extended to all
persons who directly and indirectly contributed towards the
success of this paper.
REFERENCES
[1] AaratilPatil, SrinivasaNarasimhaKini: Evolution of Social
Media from the era of Information Retrieval, International
Journal Science and Research (IJSR), 4 (14) (2015)
2326-2331.
[2] Bo P., Lillian L., and Shivakumar V.: Sentiment
Classification Using Machine Learning Techniques, In
Proceedings of the Conference on Empirical Methods in
Natural Language Processing (2002).
[3] Bhayani A., Huang R., L: Twitter sentiment classification
using distant supervision, CS224N Project Report,
Stanford (2009).
http://dx.doi.org/10.29322/IJSRP.8.10.2018.p8252 www.ijsrp.org