Sentiment Analysis Using Weka
Sentiment Analysis Using Weka
Sentiment Analysis Using Weka
III. EXPERIMENT
Machine learning is about learning from the structure of
data. Two main categories of machine learning algorithms are
Supervised and Unsupervised. In this paper, two popular
supervised machine learning algorithms namely Decision Tree
(DT) and Support Vector Machines (SVM) were used for
sentiment analysis. A general supervised machine learning
approach for Sentiment Analysis is shown in figure 1.
Supervised machine learning algorithms will be provided with
labelled data as a training set. The algorithm learns and
outputs a trained model. Effectiveness of this model will be
evaluated on the unseen data i.e., the unlabelled data set.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 181
International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number 4 Dec 2014
IV. DATASETS
Training and testing data used for this work was collected
from [4]. It was a SMS Spam collection Data Set which
consists of 5574 SMSs of positive and negative category.
Only first 200 samples were chosen. The training and test
data consists of 166 positive and 33 negative samples. Three
fold cross validation was used to evaluate the performance of
the classifiers.
V. RESULTS
WEKA [5], an open source tool is a collection of machine
learning algorithm. In the WEKA tool, initial the data set will
be loaded. Under Meta classifier, Filtered classifiers were
used. The filter used was StringToWordVector. This filter
breaks the sentence into individual word. Stemmers was used
to convert words such as Driving, Drives, Drive to a single
word Drive. Stemming reduces number of features and the
sparsity of the data. Stop list was used to avoid the words such
as I, is, the, that, etc. Term frequency count was used to count
the number of occurrences of each word in a given sentence.
Parameters used in weka for filter and decision tree are
highlight and shown in figure 3. In weka decision tree
algorithm used was J48. Under StringToWordVector filter
IDFTransform, TFTransform, outputWordCount, useStopList
was set to True. Stemmer used was IteratedLovinsStemmer.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 182
International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number 4 Dec 2014
Various tokenizers (breaking text into words or features)
such as
WordTokenizer,
AplhabeticTokenizer
and
NGramTokenizer were applied for both DT and SVM. For
NGramTokenizer, minimum NGram used was 1 and
maximum NGram used was 3. Time taken and accuracy
results are shown in table III. From the results it is observed
that SVM takes less time to build the model when compared
to DT. And also it was observed that accuracy of SVM is
better than DT.
TABLE III
TIME FOR BUILDING MODEL AND ACCURACY RESULTS AGAINST
VARIOUS TOKENIZERS
Tokeniz
er
DT
Accura
cy
Word
Alphab
etic
NGram
87.5%
89%
1.20sec
1.06sec
88%
11.78sec
Time to
build
model
SVM
Accura
cy
91%
92%
Time
to
build
model
0.08sec
0.05sec
No. of
Tokens
or
Features
extracted
873
769
87%
0.25sec
6524
VI. CONCLUSIONS
Increasing growth of social networks is giving rise to vast
amount of online data. Analysis of this data gives insightful
information for business intelligence extraction. Unstructured
social network data analysis is challenging problem. In this
work machine learning approach was applied for text analysis.
Support vector machines, a supervised machine learning
approach took less time to build model and showed great
accurate results on SMS spam text classification then Decision
tree learning approach.
TABLE I
[1]
For Decision
Trees
Negative
Actual Positive
Class Total
Predicted Class
Negative Positive
9
24
1
166
10
190
Total
33
167
200
[2]
[3]
TABLE II
CROSS VALIDATION RESULTS FOR SVM
For Support
Vector Machine
Negative
Actual
Positive
Class
Total
ISSN: 2231-5381
Predicted Class
Negative Positive
15
18
0
167
15
185
Total
33
167
200
[4]
[5]
REFERENCES
Liu, Bing. "Sentiment analysis and opinion mining."
Synthesis Lectures on Human Language Technologies
Vol 5, no. 1, 2012, pp. 1-167.
Ahmed, Ishtiaq, Donghai Guan, and Tae Choong Chung.
"SMS Classification Based on Nave Bayes Classifier
and Apriori Algorithm Frequent Itemset." International
Journal of Machine Learning & Computing, Vol 4, no.2
2014.
Rajaram, Ramasamy, and Appavu Balamurugan.
"Suspicious E-mail detection via decision tree: A data
mining approach." CIT. Journal of computing and
information technology, Vol 15, no.2, 2007, pp. 161169.
Dataset,
SMS
Spam
Collection,
URL
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
[Last Accessed November 2014]
WEKA tool, URL http://www.cs.waikato.ac.nz/ml/weka/
[Last
Accessed
November
2014]
http://www.ijettjournal.org
Page 183