111 1460444112 - 12-04-2016 PDF
111 1460444112 - 12-04-2016 PDF
111 1460444112 - 12-04-2016 PDF
Volume: 4 Issue: 3
ISSN: 2321-8169
517 - 523
______________________________________________________________________________________
R. P. Bhavsar
B. V. Pawar
School of Computer Sciences
North Maharashtra University
Jalgaon, India
[email protected]
AbstractText classification is the process in which text document is assigned to one or more predefined categories based on the contents of
document. This paper focuses on experimentation of our implementation of three popular machine learning algorithms and their performance
comparative evaluation on sample English Text document categorization. Three well known classifiers namely Nave Bayes (NB), Centroid
Based (CB) and K-Nearest Neighbor (KNN) were implemented and tested on same dataset R-52 chosen from Reuters-21578 corpus. For
performance evaluation classical metrics like precision, recall and micro and macro F1-measures were used. For statistical comparison of the
three classifiers Randomized Block Design method with T-test was applied. The experimental result exhibited that Centroid based classifier out
performed with 97% Micro F1 measure. NB and KNN also produce satisfactory performance on the test dataset, with 91% Micro F1 measure
and 89% Micro F1 measure respectively.
__________________________________________________*****_________________________________________________
I.
INTRODUCTION
_______________________________________________________________________________________
ISSN: 2321-8169
517 - 523
______________________________________________________________________________________
machine learning techniques have been proposed for document
classification such as Nave Bayes [7], K Nearest Neighbor [8],
Support Vector Machine (SVM) [9], Decision Tree (DT) [10],
Neural Network (NN) [11] etc.
In our study we have considered only supervised learning
methods to learn our classifiers and estimate them on new test
data set. Our study aims to compare three well known
classification algorithms namely NB, KNN and Centroid
Based. The performance of all three classifiers, on same data
set is evaluated and compared by using same performance
evaluation metrics.
The rest of the paper is planned as follows: Section II
summarizes Literature Survey; while section III gives
Methodology and theoretical description of NB, KNN and
Centroid Based text classification algorithms used in this paper.
Section IV describes Experimental setup and trials followed by
Result and Discussion in section V. Section VI gives
conclusion.
II.
LITERATURE SURVEY
_______________________________________________________________________________________
ISSN: 2321-8169
517 - 523
______________________________________________________________________________________
whereas stemming is a process of removing suffixes and
prefixes i.e. obtaining the root word (stem). As R-52 dataset
used in our study is already stemmed and stopwords removed
dataset.
C. Selection of Classifiers
1) Nave Bayes: Nave Bayesian is very fast and easy to
implement so it is widely used in many practical systems. It is
well known statistical method whose performance is relatively
good for large datasets, so it is generally used in text
classification problem [19],[20]. It is simple probabilistic
classifier based on Bayes theorem. This classifier makes an
independance assumption i.e. the values of the attributes are
independent [21] given a class of instance this makes it
suitable for large datasets.
Let C = {c1, c2, ,cn} be set of predefined classes and
d={w1,w2,,wm} be a document vector. We have to find
conditional probability P(ci|d) which is the probability of
document d belong to category ci. The document d will be
assigned to category ci which has maximum conditional
probability p(ci|d)
(CGC):
weight.
Centroid of each category can be used to classify test
document. To classify the test document dt, we have to find
similarity of document vector
with centroid vector of
each category {
, .. , } using cosine similarity finally
assign documents to the class having most similarity value.
That is dt is assign to the class by using
. The advantage of the CB
classification algorithm is that it summarizes the
characteristics of each class, in the form of concept vector. Its
use has been demonstrated for text classification [24].
IV.
519
IJRITCC | March 2016, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
517 - 523
______________________________________________________________________________________
TABLE I.
Category
Total
documents
Training
Testing
Alum
Coffee
50
112
31
90
19
22
Cocoa
Copper
Cpi
61
44
71
46
31
54
15
13
17
Category
Gnp
Gold
73
90
58
70
15
20
Grain
Jobs
Reserves
51
49
49
41
37
37
10
12
12
Total
650
467
155
Nave Bayes
K-Nearest
Neighbor
Centroid Based
P
(%)
R
(%)
F1
(%)
P
(%)
R
(%)
F1
(%)
P
(%)
R
(%)
F1
(%)
Alum
100
74
85
100
89
94
100
63
77
Coffee
100
87
93
100
93
97
100
80
89
Cocoa
92
100
96
100
100
100
92
100
96
Copper
100
85
92
100
100
100
92
92
92
Cpi
100
82
90
100
94
97
100
88
94
Gnp
68
100
81
83
100
91
75
100
86
Gold
91
100
95
100
100
100
100
95
97
Grain
100
90
95
100
100
100
82
90
86
Jobs
100
92
96
100
92
96
91
83
87
Reserves
80
100
89
86
100
92
67
100
80
91
91
91
97
97
97
89
89
89
93
91
91
97
97
97
90
89
88
Micro
Average
Macro
Average
V.
520
IJRITCC | March 2016, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
517 - 523
______________________________________________________________________________________
In our experimental trials we have obtained Micro Average
Precision of 91%, 97% and 89% for NB, CB and KNN
respectively. While, Macro Average Precision of 93%, 97%
and 90% was obtained for NB, CB and KNN respectively. The
precision of each category for CB is higher than other two
methods. This indicates that the CB method perform usually
high precision. This is shown graphically in Figure 4.
Source
Figure 4.
DF
Sum of
Square
Mean sum
of square
F-Ratio
Treatment
356.6
178.3
16.59
Table
value
3.55455715
Block
530.7
58.97
5.49
2.45628115
Error
18
193.4
10.74
Total
29
1080.7
CB
KNN
5.5
8.3
NB
2.8
NB
CONCLUSION:
_______________________________________________________________________________________
ISSN: 2321-8169
517 - 523
______________________________________________________________________________________
97% Micro Average F1 measure. Out of the three classifiers
KNN has obtained lowest performance of 89% Micro Average
F1 measure. Statistical comparison of different classification
algorithms shows that Centroid Based algorithm significantly
outperforms NB and KNN. Though the performance of CB is
best, we have observed that the classification speed of NB is
very fast among all three classifiers. We have also observed
that KNN being lazy learning classifier, it is slowest among all.
ACKNOWLEDGMENT
We hereby acknowledge the financial and administrative
support extended under SAP (DRS-I) scheme, UGC New Delhi
at School of Computer Sciences, NMU, Jalgaon.
REFERENCES
[1] Vishal Gupta and Gurpreet S. Lehal,A Survey of Text
Mining Techniques and Applications, Journal of
Emerging Technologies in Web Intelligence, Vol. 1, No.
1, pp. 60-76, August 2009, doi:10.4304/jetwi.1.1.60-76
[2] IzzatAlsmadi,
IkdamAlhami,
Clustering
and
classification of email contents, Journal of King Saud
University - Computer and Information Sciences,Volume
27,
Issue
1,
pp.
46-57,
January
2015, doi: 10.1016/j.jksuci.2014.03.014.
[3] Taeho C. Jo, Jerry H. Seo, Hyeon Kim, Topic Spotting
on News Articles with Topic Repository by Controlled
Indexing, Chapter, Intelligent Data Engineering and
Automated Learning IDEAL 2000. Data Mining,
Financial Engineering, and Intelligent Agents, Volume
1983 of the series Lecture Notes in Computer Science pp
386-391, 27 May 2002.
[4] HidayetTak , TungaGngr, A high performance
Centroid-Based classification approach for language
identification, Pattern Recognition Letters, Volume 33,
Issue 16, 1 December 2012, pp. 2077-2084,
doi:10.1016/j.patrec.2012.06.01.
[5] Pratiksha Y. Pawar and S. H. Gawande, A Comparative
Study on Different Types of Approaches to Text
Categorization, International Journal of Machine
Learning and Computing, Vol. 2, No. 4, August 2012.
[6] Mohammed.abdul.wajeed,
t.
Adilakshmi,
Text
Classification using Machine Learning, Journal of
Theoretical and Applied Information Technology, 2005 2009 JATIT, ISSN: 2229-73673(1-2), pp. 233-237, 2012.
[7] Rish I., An Empirical Study of the Nave Bayes
Classifier,
www.cc.gatech.edu/~isbell/reading
/papers/Rish.pdf.
[8] Taeho Jo*, (2008), Inverted Index based modified
version of KNN for text categorization, Journal of
Information Processing Systems. Vol. 4. No. 1, 2008.
[9] Gharib T. F., Habib M. B., Fayed Z. T., (2009), Arabic
Text Classification Using Support Vector Machines,
wwwhome.cs.utwente.nl/~badiehm/PDF/ISCA2009.pdf
[10] Su J. and Zhang H.,(2006), A Fast Decision Tree
Learning Algorithm, American Association for Artificial
Intelligence, AAAI'06 Proceedings of the 21st national
conference on Artificial intelligence, Vol. 1, 2006, pp.
500-505, ISBN: 978-1-57735-281-5
_______________________________________________________________________________________
ISSN: 2321-8169
517 - 523
______________________________________________________________________________________
[22] DR. Riyad Al-Shalabi, Dr. GhassanKanaanManaf H.
Gharaibeh, (2006), Arabic Text Categorization Using
KNN
Algorithm.www.uop.edu.jo/download/research/members
/CSIT2006/.../pg20.pdf
[23] K Raghuveer and KaviNarayana Murthy, Text
Categorization in Indian Languages using Machine
Learning Approaches, IICAI 2007: 1864-1883
[24] Eui-Hong(Sam) Han and George Karypis, CentroidBased Document Classification: Analysis & Experimental
Results, Principles of Data Mining and Knowledge
Discovery, pp. 424-431, 2000.
[25] www.cs.umb.edu/`smimarog/textmining/datasets/SomeTe
xtDatasets.html
[26] http://www.r-tutor.com/elementary-statistics/analysisvariance/randomized-block-design
523
IJRITCC | March 2016, Available @ http://www.ijritcc.org
_______________________________________________________________________________________