TIISVol11No6-11 1
TIISVol11No6-11 1
TIISVol11No6-11 1
2017 2996
Copyright ⓒ2017 KSII
Received November 25, 2016; revised February 20, 2017; accepted March 13, 2017;
published June 30, 2017
Abstract
With rapid growth of web technology and dissemination of smart devices, social networking
service(SNS) is widely used. As a result, huge amount of data are generated from SNS such as
Twitter, and sentiment analysis of SNS data is very important for various applications and
services. In the existing sentiment analysis based on the Naïve Bayes algorithm, a same
number of attributes is usually employed to estimate the weight of each class. Moreover,
uncountable and meaningless attributes are included. This results in decreased accuracy of
sentiment analysis. In this paper two methods are proposed to resolve these issues, which
reflect the difference of the number of positive words and negative words in calculating the
weights, and eliminate insignificant words in the feature selection step using Multinomial
Naïve Bayes(MNB) algorithm. Performance comparison demonstrates that the proposed
scheme significantly increases the accuracy compared to the existing Multivariate Bernoulli
Naïve Bayes(BNB) algorithm and MNB scheme.
Keywords: Twitter sentiment analysis, Machine learning, Naive Bayes, Attribute weighting,
Feature selection
This research was supported by Institute for Information & communications Technology Promotion(IITP) grant
funded by the Korea government(MSIP) (No.B0717-17-0070), Basic Science Research Program through the
National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology
(2016R1A6A3A11931385), the second Brain Korea 21 PLUS project. Corresponding author: Hee Yong Youn.
1. Introduction
Nowadays, social networking service(SNS) is widely used all over the world, and the
number of SNS users has been growing dramatically. People use SNS such as Twitter,
Facebook, Linked-in, etc. to share their thought, view, and life in online communities, and
huge amount of data are created from SNS in real time. Because SNS shows cooperative and
interdependent relationship of the individuals of a group, sentiment analysis of SNS [1,2] is an
important research for the confirmation of the majority opinion of people and development of
intelligent user interface. It is required for providing optimal service and assessing social issue.
Critical decision making process also needs to utilize the results of sentiment analysis in
various fields. For example, service providers can grasp the response of the users on their
services, and manufactures can use it for marketing research. Also, optimal service can be
provided to the users with some recommendation system utilizing the sentiment analysis
technique [3,4]. Hence, numerous researchers have been attracted to sentiment analysis which
involves novel techniques including machine learning [19].
Twitter is one of the most popular SNS platforms used to express opinion, thought, and view
of users. The user of Twitter can read and post a 140-character message called a ‘tweet’. The
number of monthly active users of Twitter is more than 313 million in 2016 and almost 500
million posted tweets are generated per day [6]. Due to simple and easy accessibility to
massive amount of messages generated in real time, Twitter data has been typically adopted as
the data set for sentiment analysis. Using various machine learning algorithms, sentiment
analysis classifies a Twitter message into ‘positive’ or ‘negative’, and sometimes ‘neutral’.
Machine learning is a powerful technique allowing computer to learn specific topic and
predict the result based on that like human [7]. Furthermore, a particular solution can be found
using machine learning with exceptional performance compared to human brain. Among
various machine learning algorithms, Naïve Bayes algorithm is generally used for the
classification problem due to its simplicity and effectiveness [8]. Twitter sentiment analysis
with deep learning is also an important issue recently [20]. Deep learning requires relatively
large resource in terms of hardware and computation time. This paper targets sentiment
analysis with resource constrained system, and thus the scheme based on Naïve Bayes is
focused.
A number of approaches have been proposed including attribute weighting, feature selection,
and so forth to improve the performance of Naïve Bayes algorithm [9]. Most attribute
weighting approaches for text classification utilize a same number of attributes to calculate the
weight of each class. In sentiment analysis of Twitter the number of positive words and
negative words are different, and the difference can influence the weight of each class.
Because the number of words in positive tweets is greater than negative tweets in most cases,
the negative words get collateral benefits in attribute weighting. Moreover, the existing feature
selection approaches extract a subset of attributes based on the weight of each attribute to
enhance the performance of Naïve Bayes. However, these approaches are not suitable for
Twitter sentiment analysis since Twitter data has uncountable attributes and contains various
meaningless words such as typing error.
In order to overcome the limitation of the existing schemes, new methods of feature selection
are proposed for sentiment analysis of Twitter data based on Naïve Bayes algorithm. The first
method divides the training set into positive and negative one to calculate the number of
2998 Song et al.: A novel classification approach based on Naïve Bayes for Twitter sentiment analysis
positive words and negative words separately with each set for attribute weighting. The second
one utilizes the difference between the weight of positive and negative word for feature
selection. Based on the average of the differences of the weights, the weight of some words is
changed to zero. This lets meaningless words such as typing error be effectively excluded
when the test document is classified. The proposed scheme is evaluated and compared with the
existing schemes using 70,000 training document and 3,000 test document obtained from
Sentiment140 [10]. The simulation demonstrates that the proposed scheme predicts the class
of test set with higher accuracy than the existing approaches. It also identifies that attribute
weighting is slightly more influential to the accuracy than feature extraction.
The rest of the paper is organized as follows. Section 2 presents the background and related
researches, and Section 3 introduces the proposed scheme. The simulation results of the
proposed scheme are presented in Section 4. And finally, Section 5 gives the conclusion and
future researches.
2. Related Work
2.1 Twitter Sentiment Analysis
In most sentiment analysis of Twitter, binary classification is generally performed in which the
target text is classified as positive, negative, or neutral. In past years a number of researchers
have been studying Twitter sentiment analysis using various machine learning techniques. The
general structure of Twitter sentiment analysis is shown in Fig. 1.
Training set
Positive Negative Data preprocessing
tweets tweets
Sentiment analysis
Model
Classifier
Negative
Classification
Positive
An approach for automatic classification of the sentiment of Twitter messages using distant
supervision was presented in [6]. The training data of them consists of Twitter message and
emoticons used as noisy labels. They used Naïve Bayes, Maximum Entropy(MaxEnt),
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 11, NO. 6, June 2017 2999
Support Vector Machine(SVM) to build models, with 1,600,000 training set and 359 test set
extracted from Sentiment140. Their feature extractors are unigrams, bigrams, unigrams and
Part Of Speech(POS) tags, and they arrived at a conclusion that POS tags were not useful.
The POS-specific prior polarity features and a tree kernel were presented to remove the
necessity for tedious feature engineering and combine many categories of features into one
convenient representation [11]. They demonstrated that tree kernel and feature-based
approach outperform the unigram baseline using SVM. A feasible solution with good accuracy
and time efficiency was also proposed in [5]. Here a new feature combination scheme was
developed using the sentiment lexicons and the extracted tweet unigrams of high information
gain. Multinomial Naïve Bayes(MNB) was used with 9,000 training set and 1,000 test set, and
MNB was found to be the best choice for tweet sentiment analysis. An approach was
introduced in [12] which selects a new feature set using information gain, bigram, and
object-oriented extraction method. Based on Naïve Bayes and SVM, the accuracy of classifier
was shown to be improved through the feature set.
where n is the number of training documents, l is the number of classes, fji is the frequency of
wi in jth training document, cj is the class of the jth training document. The binary function
δ(cj,c) can be defined using Eq. (5).
1, if 𝑐𝑗 = 𝑐
𝛿�𝑐𝑗 , 𝑐� = � (5)
0, otherwise
Multivariate Bernoulli Naïve Bayes(BNB) model is another famous statistical language model
proposed for text classification. In contrast with MNB, BNB assumes that each feature in a
document is described as independent binary variable. BNB considers only the existence of a
word in the document without referring to its frequency.
The methods enhancing the performance of Naïve Bayes by machine learning are classified
into five categories which are structure extension, feature selection, attribute weighting, local
learning, and data expansion [17]. Among these methods, the attribute weighting method of
using different weight for each attribute and the feature selection method of selecting a subset
of attributes based on the weight are widely employed. Naïve Bayes with attribute weighting
can be expressed as:
𝑚
Training set
Data preprocessing Positive Negative
tweets tweets
• Non-English tweets
• Numbers
• Hash tags (e.g. #topic), targets (e.g. @username)
• URL and e-mail address
• Special characters including emoticons
Begin
1:Divide 𝐷 into 𝐷𝑐 (𝑐 ∈ {positive, negative})
2: For 𝑤𝑖,𝑐 (𝑖 = 1,2, … , 𝑚𝑐 )from 𝐷𝑐
Calculate𝐼𝐺𝑅�𝑐, 𝑤𝑖,𝑐 �using Eq. (16)
3: For 𝑤𝑖,𝑐 (𝑖 = 1,2, … , 𝑚𝑐 ) from divided 𝐷𝑐
Calculate the weight 𝑊𝑇𝑖,𝑐 using Eq. (15)
4: For 𝑤𝑖 (𝑖 = 1,2, … , 𝑚 ) from 𝐷
Calculate the weight difference, 𝑊𝐷𝑖 , using Eq. (21)
5: Calculate the average of 𝑊𝐷𝑖 ’s using Eq. (22)
6: For 𝑤𝑖,𝑐 (𝑖 = 1,2, … , 𝑚𝑐 ) from 𝐷𝑐
Modify 𝑊𝑇𝑖,𝑐 using Eq. (23)
7: For 𝑑
(a) Calculate 𝑃(𝑐)using Eq. (3)
(b) Calculate 𝑃 (𝑤𝑖 |𝑐 ) using Eq. (14)
(c) Predict 𝑐 (𝑑) using Eq. (7)
8:Return 𝑐 (𝑑)
End
3004 Song et al.: A novel classification approach based on Naïve Bayes for Twitter sentiment analysis
The class label of d1 is predicted with MNB using Eq. (2). log(P(pos|d1)) and log(P(neg|d1))
are calculated as:
=
logP ( pos | d1 ) logP(not | pos ) + logP(do | pos )
+logP ( forget | pos ) + logP(ever | pos )
+logP (cheer | pos ) + logP(never | pos ) (24)
+logP ( pos )
= − 5.53164
=
logP(neg | d1 ) logP(not | neg ) + logP(do | neg )
+logP ( forget | neg ) + logP(ever | neg )
+logP (cheer | neg ) + logP(never | neg ) (25)
+logP (neg )
= − 5.22405
Since log(P(pos|d1)) <log(P(neg|d1)), the class label of d1 is predicted to be negative. Note that
the nuance of the test document is not negative but positive, and the prediction is wrong. We
next show how it is predicted using the proposed scheme.
For the same input data, WTi,c of each word, wi,c (i = 1,2,…, mc), in divided training set Dc and
the weight difference, WDi, are computed using Eq. (15) and Eq. (21) with the proposed
scheme. Table 2 shows the weights and the weight differences of the words in D.
4. Performance Evaluation
In this section the accuracy of Twitter sentiment analysis using the proposed scheme is
examined through computer simulation. It is also compared with the existing MNB and BNB
scheme. The accuracy of Twitter sentiment analysis is calculated as:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100% (29)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Here TP, TN, FP, FN are the number of true positive, true negative, false positive, and false
negative documents, respectively. Table 4 shows the general confusion matrix.
To evaluate the accuracies, the subsets of 1,600,000 data set supplied by Sentiment140 are
used in the computer simulation. The total data set contains 800,000 tweets of positive class
and 800,000 tweets of negative class. In this experiment the ratio of positive and negative
classes is 5:5 in training and test sets.
Table 5 is the number of words in the training sets. The size of each training set is 10,000,
20,000, 30,000, 40,000, 50,000, and 70,000. After the preprocessing of the training set, the
number of words in the training sets is dropped about 38~53%. Also, the decrease in the
number of positive words is about 2~4% larger than negative words.
Fig. 3 shows the accuracies of Twitter sentiment analysis with the proposed scheme using the
1,000 test set and different training sets. The accuracies with the MNB, BNB scheme are also
compared. Observe from the figure that the accuracy of the proposed scheme is higher than the
other schemes. As the size of training set increases, the accuracy of all the schemes also
increases as expected.
Fig. 3. The accuracies of Twitter sentiment analysis for 1,000 test set
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 11, NO. 6, June 2017 3007
Fig. 4 and Fig. 5 show the accuracies with 2,000 test set and 3,000 test set, respectively. Notice
from the figures that similar results as with the 1,000 test set were achieved.
Fig. 4. The accuracies of Twitter sentiment analysis for 2,000 test set
Fig. 5. The accuracies of Twitter sentiment analysis for 3,000 test set
As mentioned earlier, attribute weighting and feature selection are important steps in
sentiment analysis, and thus we have proposed a new approach for each of them for improving
the accuracy. Fig. 6 shows relative effectiveness of them with 1,000 test set. Here the bars of
attribute weighting are for the case of sentiment analysis based on the proposed attribute
weighting approach without feature selection. The bars of feature selection are with the
existing attribute weighting approach and the proposed feature selection approach. The data
with both of the proposed approached are shown with white bars. Notice from the figure that
feature selection is always more influential to the accuracy than attribute weighting regardless
of the number of training documents. Also, the proposed attribute weighting approach is
always more effective than the existing approach as identified by comparing the bars of
‘Feature selection’ and ‘Proposed’.
Fig. 6. The accuracies of the proposed approaches for 1,000 test set
3008 Song et al.: A novel classification approach based on Naïve Bayes for Twitter sentiment analysis
Fig. 7 and Fig. 8 are the accuracies comparing the attribute weighting approach and feature
selection approach using 2,000 test set and 3,000 test set, respectively. Notice from the figures
that similar results as case of the 1,000 test set were achieved.
Fig. 7. The accuracies of the proposed approaches for 2,000 test set
Fig. 8. The accuracies of the proposed approaches for 3,000 test set
Fig. 9 investigates the performances of the schemes for the positive sets and negative sets
separately using 50,000 training set and 3,000 test set. Observe that the proposed approaches
consistently excel the existing MNB and BNB scheme. Also notice that the accuracies for
positive sets are higher than for negative sets. This is because the document is deemed to be
positive if the basis for classifying the test document is insufficient.
Table 6 compares the proposed scheme with the existing schemes in various aspects. Here the
maximum achievable accuracies of the previous schemes were quoted from the papers. Notice
that the experimental settings of them are different. Therefore, for fair comparison, the
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 11, NO. 6, June 2017 3009
5. Conclusion
In this paper a novel attribute weighting and feature selection approach for Twitter sentiment
analysis have been presented based on Naïve Bayes. Most attribute weighting and feature
selection approaches based on the Naïve Bayes algorithm employ a same number of attributes
to estimate the weight of each class. As a result, if the number of attributes of each class is
different, the difference can influence the weight of each class. Since the number of words in
positive tweets is greater than negative tweets in most cases, the negative words get collateral
benefits in attribute weighting. Moreover, the existing feature selection approaches are not
effective to consider uncountable attributes and meaningless attributes. These result in
decreased accuracy of Naïve Bayes in Twitter sentiment analysis. Two methods were
proposed to resolve these issues. The first method effectively reflects the difference in the
number of positive words and the number of negative words in calculating the weights, while
the second one identifies significant words to predict the class of test document. According to
the experiment with actual test documents, the proposed approach consistently allows higher
accuracy than the existing Naïve Bayes based approaches for Twitter sentiment analysis.
As future study the proposed scheme will be enhanced by applying various N-gram problems
such as bigram and trigram using more sophisticated attribute weighting and feature selection
approach. The effectiveness of the proposed approach will also be investigated with the
application to other classification problem such as text classification and traffic classification.
References
[1] Sitaram Asur and Bernardo A. Huberman, “Predicting the Future with Social Media,” in Proc. of
the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent
Technology, pp.492-499, 2010. Article (CrossRef Link)
[2] Jeffrey Nichols, Jalal Mahmud and Clemens Drews, “Summarizing Sporting Events Using
Twitter,” in Proc. of the 2012 ACM international conference on Intelligent User Interfaces,
pp.189-198, 2012. Article (CrossRef Link)
[3] Anurag P. Jain and Vijay D. Katkar, “Sentiments analysis of Twitter data using data mining,” in
Proc. of International Conference on Information Processing,pp.807-810, 2015.
Article (CrossRef Link)
[4] Vishal A. Kharde and S.S. Sonawane, “Sentiment Analysis of Twitter Data: A Survey of
Techniques,” International Journal of Computer Applications, vol. 139, no. 11, pp.5-15, April
2016. Article (CrossRef Link)
[5] Ang Yang, Jun Zhang, Lei Pan and Yang Xiang, “Enhanced Twitter Sentiment Analysis by Using
Feature Selection and Combination,” in Proc. of International Symposium on Security and Privacy
in Social Networks and Big Data, pp.52-57, 2015. Article (CrossRef Link)
[6] Alec Go, Richa Bhayani andLei Huang, “Twitter Sentiment Classification using Distant
Supervision,”CS224N Project Report, Stanford. 1, 2009. Article (CrossRef Link)
3010 Song et al.: A novel classification approach based on Naïve Bayes for Twitter sentiment analysis
[7] Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing
Survey, vol. 34, no. 1, pp.1-47, March, 2002. Article (CrossRef Link)
[8] S. B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques,”
Informatica, vol. 31, no. 3, pp.249-268, 2007. Article (CrossRef Link)
[9] Jingnian Chen, Houkuan Huang, Shengfeng Tian and Youli Qu, “Feature selection for text
classification with Naïve Bayes,” Expert Systems with Applications, vol. 36, no. 3, pp.5432-5435,
April, 2009. Article (CrossRef Link)
[10] Saif M. Mohammad, Svetlana Kiritchenko and Xiaodan Zhu, “NRC-Canada: Building the
State-of-the-Art in Sentiment Analysis of Tweets,” in Proc. of the seventh international workshop
on Semantic Evaluation Exercises, 2013. Article (CrossRef Link)
[11] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow and Rebecca Passonneau, “Sentiment
analysis of Twitter data,” in Proc. of the Workshop on Languages in Social Media, pp.30-38, 2011.
Article (CrossRef Link)
[12] Bac Le and Huy Nguyen, “Twitter Sentiment Analysis Using Machine Learning Techniques,”
Advanced Computational Methods for Knowledge Engineering, pp.279-289, 2015.
Article (CrossRef Link)
[13] Jia Wu, Shirui Pan, Xingquan Zhu, Zhihua Cai, Peng Zhang and Chengqi Zhang, “Self-adaptive
attribute weighting for Naive Bayes classification,” Expert Systems with Applications, vol. 42, no.
3, pp.1487-1502, February, 2015. Article (CrossRef Link)
[14] Nir Friedman, Dan Geiger and Moises Goldszmidt, “Bayesian Network Classifiers,” Machine
Learning, vol. 29, no. 2, pp.131-163, November, 1997. Article (CrossRef Link)
[15] Andrew McCallum and Kamal Nigam, “A Comparison of Event Models for Naive Bayes Text
Classification,” in Proc. of AAAI-98 workshop on learning for text categorization, pp. 41-49, 1998.
Article (CrossRef Link)
[16] Lungan Zhang, Liangxiao Jiang, Chaoqun Li and Ganggang Kong, “Two feature weighting
approaches for naive Bayes text classifiers,” Knowledge-Based Systems, vol. 100, no. 15,
pp.137-144, May, 2016. Article (CrossRef Link)
[17] Liangxiao Jiang, Harry Zhang andZhihua Cai, “A Novel Bayes Model: Hidden Naive Bayes,”
IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 10, pp.1361-1371, October,
2009. Article (CrossRef Link)
[18] Liangxiao Jiang, Chaoqun Li, Shasha Wang and Lungan Zhang, “Deep feature weighting for naive
Bayes and its application to text classification,” Engineering Applications of Artificial Intelligence,
vol. 52, pp.26-39, June, 2016. Article (CrossRef Link)
[19] Xuemeng Song, Zhao-Yan Ming, Liqiang Nie, Yi-Liang Zhao and Tat-Seng Chua, “Volunteerism
Tendency Prediction via Harvesting Multiple Social Networks,” ACM Transactions on
Information Systems, vol. 34, no. 2, pp.1-27, April, 2016. Article (CrossRef Link)
[20] Aliaksei Severyn and Alessandro Moschitti, “Twitter Sentiment Analysis with Deep
Convolutional Neural Networks,” in Proc. of International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 959-962, August, 2015. Article (CrossRef Link)
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 11, NO. 6, June 2017 3011
Junseok Song received the BS degree in computer science from Seokyeong University,
Seoul, Korea, in 2015. He is currently working toward the MS degree in College of
Software in Sungkyunkwan University and research staff of Ubiquitous computing
Technology Research Institute. His current research interests include Internet of Things
Technology, Big data, and Machine learning.
Kyung Tae Kim received the PhD degree in College of Information and
Communication Engineering from Sungkyunkwan University, Korea, in 2013. He is
currently a research professor at the college of software from Sungkyunkwan University,
Korea. His current research interests include Ubiquitous computing, Wireless Networks,
and Internet of Things Technology.
Sangyoung Kim received the BS degree in computer engineering from Inje University,
Gyeongsangnam-do, Korea, in 2015. He is currently working toward the MS degree in
College of Software in Sungkyunkwan University and research staff of Ubiquitous
computing Technology Research Institute. His current research interests include Internet
of Things Technology, Big data, and Software Defined Networking.
Hee Yong Youn received the BS and MS degree in electrical engineering from
Seoul National University, Seoul, Korea, in 1977 and 1979, respectively, and the PhD
degree in computer engineering from the University of Massachusettsat Amherst, in
1988. He had been Associate Professor of Department of Computer Science and
Engineering, The University of Texas at Arlington until 1999. He is presently Professor
of College of Software, Sungkyunkwan University, Suwon, Korea, and Director of
Ubiquitous computing Technology Research Institute. He has been also Consulting
Professor of Software R&D Center, Device Solutions, Samsung Electronics, Korea. His
research interests include distributed and ubiquitous computing, IoT, and intelligent
system. He has published more than 400 papers in int'l journals and conference
proceedings, and received Outstanding Paper Award from the 1988 IEEE International
Conference on Distributed Computing Systems, 1992 Supercomputing, and 2012 IEEE
Int’l Conf. on Computer, Information and Telecommunication Systems, 2014 The 6th
International Conference on Cyber-Enabled Distributed Computing and Knowledge
Discovery, respectively. Dr. Youn has also been General Chair of IEEE PRDC 2001, Int’l
Conf. on Ubiquitous Computing Systems (UCS) in 2006 and 2009, UbiComp 2008,
CyberC 2010, Program Chair of PDCS 2003 and UCS 2007. Dr. Youn is a senior member
of the IEEE Computer Society.