(IEEE Semantic 2008 Pingpen Yuan) MSVM-KNN Multi-Class Text Classification

The document presents a novel approach called Multi-class SVM-kNN (MSVM-kNN) that combines Support Vector Machine (SVM) and k-Nearest Neighbor (k-NN) for improved multi-class text classification. This method aims to address the limitations of SVM and k-NN by first using SVM to identify category borders and then applying k-NN for classification within those borders. Experimental results demonstrate that MSVM-kNN outperforms traditional SVM and k-NN methods in text categorization tasks.

Uploaded by

Rendi Agus S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views8 pages

(IEEE Semantic 2008 Pingpen Yuan) MSVM-KNN Multi-Class Text Classification

Uploaded by

Rendi Agus S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

IEEE International Workshop on Semantic Computing and Systems

MSVM-kNN: Combining SVM and k-NN for Multi-Class Text Classification*

Pingpeng Yuan, Yuqin Chen, Hai Jin, Li Huang

Service Computing Technology and System Lab
Cluster and Grid Computing Lab
School of Computer Science and Technology
Huazhong University of Science and Technology, Wuhan 430074, China
{ppyuan, hjin}@mail.hust.edu.cn

Abstract plays the key role in organizing the massive sources of

unstructured text information. Filtering spam emails,
Text categorization is the process of assigning retrieving information based on user’s preferences, are
documents to a set of previously fixed categories. It is the examples where text categorization is used.
widely used in many data-oriented management A typical categorization problem can be stated as
applications. Many popular algorithms for text follows: given a set of labeled examples belonging to
categorization have been proposed, such as Naïve two or more classes (training set), the classifier
Bayes, k-Nearest Neighbor (k-NN), Support Vector classifies a sample to a class with the highest similarity.
Machine (SVM). However, those classification Up to now, many popular algorithms have been applied
approaches do not perform well in every case, for to text categorization, such as Naïve Bayes, k-Nearest
example, SVM can not identify categories of documents Neighbor [2], Support Vector Machine [3], Decision
correctly when the texts are in cross zones of multi- Tree, Neural Network [4]. Among these methods, k-
categories, k-NN cannot effectively solve the problem NN and SVM are used commonly, and achieve better
of overlapped categories borders. In this paper, we performance both in theory and practices [5, 6].
propose an approach named as Multi-class SVM-kNN Compared with other categorization methods, SVM has
(MSVM-kNN) which is the combination of SVM and k- apparent advantages in avoiding over-fitting of the
NN. In the approach, SVM is first used to identify result. But SVM is actually a binary classifier.
category borders, then k-NN classifies documents However in practice many problems are multi-class
among borders. MSVM-kNN can overcome the categorization problem. So SVM can not directly used
shortcomings of SVM and k-NN and improve the in solving multi-categorization [7]. k-NN is a kind of
performance of multi-class text classification. The example-based text categorization algorithm. However,
experimental results show MSVM-kNN performs better the determination of the k has not yet got good
than SVM or kNN. solution. Moreover, the good selection of k most
similar texts also has bigger effect on categorization
results.
In this paper, we combine SVM with k-NN and
1. Introduction propose an approach named as MSVM-kNN for multi-
class categorization. Then we report on our
With the rapid development of IT, various experimental results using two data sets. One is the 20-
information grows exponentially. Modern information newsgroups data set, which is categorized into twenty
society is facing the challenge of handling on line groups, each item of news is indexed manually by
massive documents (including paper, technical human experts. The other is literature set from ACM,
reports), news, email, and so on. With the huge number which have been classified into nine categories
of data available on Internet, there is a growing need manually by ACM.
for text categorization so as to help users manage and The rest of the paper is organized as follows. Section
utilize those data. Text categorization [1], which 2 discusses the related work. In section 3, SVM and k-
assigns text documents to pre-specified multi-classes, NN which are the base of our approach are introduced.
Section 4 describes the MSVM-kNN approach. The
*
This paper is supported by the National 973 Basic Research implementation of automatic literature categorization
Program of China under grant No.2003CB317003

DOI 10.1109/WSCS.2008.36
system which adopts MSVM-kNN and our speed and reducing the training samples are also an
experimental results are presented in section 5. Finally, urgent problem.
section 6 concludes the paper.
3. Support Vector Machine and k-Nearest
2. Related Work Neighbor
Text categorization is the process of assigning Currently there are many categorization algorithms
documents to a set of previously fixed categories and used in text categorization. Two of those algorithms are
plays a very important role in many data management SVM and k-NN which are the base of our approach. In
applications. Up to now, many popular text the following, we will give a brief introduction of SVM
categorization methods, such as Naïve Bayes, k- and k-NN.
Nearest Neighbor, Support Vector Machine, Decision
Tree, Neural Network, have been successfully applied 3.1 Support vector machine
to text categorization.
Berger and Ghani studied multi-class problem using
SVM is a kind of categorization method based on
Error-correcting Output Codes with Naïve Bayes as the
statistical learning theory. SVM shows many unique
binary classifier, and reported the greatest reduction in
advantages in solving the problem of small sample,
error on industry sector (a data set with 105 classes).
non-linear shape and high dimension pattern
Pazzani et al. developed a software agent that learned
reorganization. In 1998, Joachims first introduced
to rate pages on the www based on user judgment.
SVM to text categorization.
They compared three different algorithms: the
The main idea of this algorithm is to find the best
Bayesian classifier, Decision Tree and k-NN with a
hyperplane that can separate the positive from negative
binary feature vector on the two categories of user
training examples. Best hyperplane means that the
preferences [8]. Their empirical results indicated that
hyperplane separate the positives from the negatives by
decision tree was not suited to their problem and the k-
the widest possible margin [17]. Figure 1 shows a 2-
NN classifier worked well over other methods when
dimensinal case. It is noteworthy that the best
presented with a large number of examples. Ou and
hyperplane is determined by only a small set of training
Murphey [9] studied in multi-class neural learning,
examples, called the support vectors.
implemented multi-class pattern classifier using neural
networks and analyzed training time complexity
associated with other approaches.
Since Joachims firstly introduced SVM to text
categorization in 1998 and showed theoretically why Margin
SVM is well suited to text categorization tasks with a
large number of documents [10], SVM gained more
and more popularity in text categorization. Chapelle
and Zien proposed a primal method, which turned out
to show improved generalization performance over the
previous approaches [11]. Since SVM is actually a
binary classifier, in practice many problems are multi- H H2
class categorization problem.
Support Vectors H1 Hyperplane
For multi-class categorization, Weston and Watkins
proposed ‘one-vs-rest’ [12]. Platt applied sigmoid
regression to the output of SVM and showed a Figure 1 Hyperplane Division
performance comparable to regularized maximum SVM is mainly based on the following three
likelihood kernel methods [13]. Schölkopf et al used considerations: (1) Minimize the structural risk by
‘one-vs-one’ methods [14]. Gao and Wu proposed a minimizing the Vapnik-Chervonenkis dimension of the
Multi-Class (MC) Maximal Figure of Merit (MFoM) minimum function set so as to make it possess stronger
approach [15]. Wang and Chiang incorporated fuzzy generalization ability; (2) By maximizing the distances
set theory into OAA-SVM, and proposed an OAA- between categories (find the optional categorization
FSVM classifier to implement a multi-class hyperplanes) to realize the control of Vapnik-
categorization system [16]. But there still exist Chervonenkis dimension. This can be assured by
indivisible cases when using SVM for multi-class statistical learning theory; (3) SVM tries to find a
categorization. Besides, improving SVM categorization kernel function in an inner product space by mapping a

134
sample space to a transform space so as to find a indivisible problem. k-NN is adopted to classify
solution for nonlinear problems. indivisible documents after SVM is used to classify
Compared to other categorization methods, SVM has documents. Before introducing the incremental
apparent advantages in avoiding over-fitting, slow learning method and indivisible documents
computation speed and rough precision of the result. classification, we describe term weight assignment of
But it has weaknesses, mainly the following aspects. our approach which is the base of classification.
SVM is actually a binary classifier. However in
practice many problems are multi-class categorization Information entropy based term weight
problem. So SVM can not directly used in solving assignment
multi-categorization [7]. Besides, improving SVM
categorization speed and reducing the training samples In text categorization, the commonly document
are unsolved problems. representation scheme is vector space model, in which
the structure of a document and the order of words in
3.2 k-Nearest Neighbor the document are ignored. The feature vectors represent
the words observed in the documents. Every term in the
k-NN is an example-based text categorization feature vectors is assigned weights.
algorithm. When k-NN is applied to text categorization, The weights of the terms are generally assigned
it calculates the cosine of angle between the test text using the standard tf*idf which is shown as formula 1.
and each sample text in training set to get the
W tfidf ( t , d ) = tf ( t , d ) × log( N n t + 0.01) (1)
similarities, then takes k most similar training texts
according to their similarities, and produces scores for Here tf is term frequency, and idf is inversed
each category based on these similarities. Finally, rank document frequency. The formula 1 is widely used in
these scores in the order. The test text belongs to the information retrieval and text categorization. However,
biggest score category. the contribution of terms is not introduced in formula 1.
k-NN is suitable for little data set and can achieve In fact, the importance of a term depends on not only
better performance. However, the determination of the its term frequency and document frequency, but also its
k has not yet got good solution. So k-NN usually takes contribution to classification in text categorization. For
an initial value first, then adjusts the k value according example, term t1 and t2 have the same term frequency
to experimental results. in document d1 and the same document frequency in
The good selection of k most similar texts also has the training set, but the documents in which t1 occurs
bigger effect on categorization results. However in only appear in a category and those documents in
practical applications, samples generally have the which t2 occurs uniformly appear in each category.
following characteristics [18]: the samples of the same Obviously, t1 is more important than t2 for
category samples show multi-peaks distribution, categorization, and t1 should have higher weight than
therefore the distances between samples of the same t2. However, if we compute weights by the standard
category are likely larger than the ones between tfidf function, weight of t1 is equal to t2.
different categories. When the distribution of the Considering the relationship between terms and
samples is not very compact, many samples will be categories, we calculate the distribution of those
nearing the borders of the categories, many documents in categories in course of weighting terms.
misclassified cases will occur. By this way, the This distribution can be weighed by information
algorithm would lead to inaccuracy in many cases. entropy H, which is defined as:
Therefore k-NN algorithm cannot effectively solve the d f ik d f ik
H ( t i , d j ) = − ∑ k =1
C
lo g (2)
problem of overlapped categories borders. d n (ti) d n (ti)
where dfik denotes the document frequency of ti, dn(ti)
4. The Multi–class SVM-kNN Approach denotes the number of documents in category c in
which ti occurs, and |C| denotes the number of
In this section, we describe our approach, Multi- categories.
class SVM-kNN (MSVM-kNN) which is the Some research show that, the more uniformly those
combination of SVM and k-NN. In general text documents embodying ti, the higher the value of H is;
categorization, the popular approaches of mutli-class and the less uniformly those documents distribute, the
SVM are 1-vs-1 and 1-vs-r (rest) [19]. Both of them are lower the value of H is. When and only when the
time consuming, and may have indivisible cases. To probability of documents in each category is equal, the
reduce the training time, we adopt the incremental value of H is the highest. So, we propose a method for
learning method in MSVM-kNN to solve the computing term weights, tfidfh, that is:

135
tf ( t , d ) × log( N n t + 0.01) As we can see above, the equivalence between the
W tfidfh ( t , d ) = (3) SV set and training set is broken when the new samples
H (t , d )
are introduced into the training set in the procedure of
This function embodies the intuition that the more
the incremental learning. Thus, on one hand the
frequently a term occurs in a document, the more it is
original SV set can not fully describe the categorization
representative of its content, the more documents a
properties of the new training set. On the other hand,
term occurs, the less discriminating it is. The more
with the introduction of the new samples, some old
uniformly those documents containing a term distribute
samples must be discarded optimally to reduce the
in a training set, the looser its relationship with
storage cost. Therefore, two major following problems
categories is. In the forgoing example, when the tfidfh
will be met in the incremental SVM learning algorithm:
function is used to compute the term weights, t1 has the
(1) how to construct the new SV set from the initial
higher value than t2.
training set; (2) how to discard old samples optimally.
Text categorization is often characterized by the high
Traditional SVM algorithms discard all previous
dimensionality of the associated feature space and a
training results and re-train the new classifier on the
relatively small number of training samples. One way
whole data set in the case of incremental learning. This
to reduce the number of features in document
method is so slow that it cumbers the application of
categorization is to select a subset of the best terms
SVM learning algorithm greatly in the area of
from the entire feature space [12]. In this paper, we use
incremental learning. Here we describe a new scheme,
weight based best n features selection approach, in
which uses an iterative method to find out the optimal
which terms are sorted by their weights and the top n
convergent result on the whole training set. This
terms with the highest weights are selected. We
scheme can be described as follows: First, the old
compute the information gain for each unique term,
classifier is tested on the new incremental sample set.
and choose the top n as the feature terms of the
Those samples classified incorrectly is combined with
documents.
the current SV set to construct a new training set, and
the rest samples form a new test set. Then a new SVM
Incremental learning for multi-class SVM classifier is trained on the new training set, and it can
use the new test set to repeat previous operations. The
Vapnik [1] has proven that the training result of process continues until all data points are classified
SVM only depends on a small set of samples, which is correctly.
so-called Support Vector (SV), and the normal of the
optimal hyperplane HP can be expressed as the 4.3 k-NN based indivisible cases classification
combination of the SVs. Usually, the SV set is only a
small portion of the training set. Thereby if we train the For multi-class categorization using SVM, we must
SVM on the SV set instead of the whole training set, make several binary classifiers to construct the multi-
the training time can be reduced greatly without loss of class model needed. However, these binary classifiers
much categorization precision. According to the idea, divide the sample space into several zones. Some zones
we divide the samples of the SV set into two classes: can be identified as a category. But the other zones
BSV (Boundary Support Vector) and NSV (Normal may locate in multi-hyperplanes. Classifiers based
Support Vector). BSV corresponds to the vectors of SVM can not identify samples in these zones
documents which can not be classified correctly. These effectively.
two kinds of SVs determine the results of the classifier To solve this problem, we introduce k-NN into the
together. classifier. If there is indivisible document, we find k
Experiments also show that these features do not similar neighbors of the test document from the SVs,
always remain the same. When the successive training and then produce a score for each category based on
samples are introduced, BSV, NSV, and normal these similarities. The score value is the sum of the
samples can inter-transform. For example, due to the similarities of the texts in the training sets and belongs
limitation of the initial training set, the samples that to the same category. The test document belongs to the
represent the main categorization information may be biggest score category. By this way, we need not
miss-classified as BSV. As the accumulation of calculate the similarities of all the samples, reduce the
knowledge about the categorization in the successive time consumption, and solve the indivisible case.
incremental training, the contribution of these samples
to categorization will be clear. At the same time, these
BSVs are gradually degraded to NSVs or normal 5. Implementation and Experimental
samples. Results

136
5.1 Implementation the backup set do not affect the result of the finial
classifier, they can be discarded gradually and the
We have implemented an automatic literature remaining samples can be used as the samples of the
categorization system in SemreX [20, 21]. The goal of test set to validate the new trained classifier. If the
automatic literature categorization system (ALC) is to samples in the working set are actively involved in the
classify a set of documents into a fixed number of computation of the final classifier, it can be directly
predefined categories. The whole categorization introduced into the training set of the new classifier to
process can be divided into three stages: text accelerate the training procedure. In general, the
processing stage, training stage, and testing stage. In contribution of cache set to the final classifier is
text processing stage, ALC uses vector space model to between the backup set and working set. The samples
represent the documents (both training set and testing in the caching set affect the previous categorization
set). Unlabeled document must be preprocessed and result. So they can be selected into the training set of
represented by VSM before labeling it [22]. In second the new classifier according to a certain scheme and
stage, ALC learns the training set and constructs priority.
classifier based on MSVM-kNN. Finally, we use the
classifier to classify unlabeled documents. Figure 2 5.2 Experimental results and analysis
shows the overall flow diagram of the ALC.
For our experiments, we use two data sets: one is a
Text Processing Training well-known data set - 20 newsgroups dataset; the
second data set is consisted of literatures collected from
ACM, which is named as ACM data set here. The 20
newsgroups data set consists of 20 different news
TrainingSet
Training groups. Each group contains approximately 1000
Classifier
algorithm documents. In this paper, we select 19,996 documents
Preprocess
Feature of them. The ACM data set contains nine categories, 25
Selection
documents each category.
In our experiments, we compare the performance of
Testing set our approach MSVM-kNN with the performance of
Categories SVM and k-NN. Before presenting the experimental
results, we introduce performance measures commonly
used to evaluate categorization methods. The
performance measures include Precision, Recall, F-
Testing measure.
Figure 2 Flow Diagram of ALC The performance measures are computed as follows.
In one experiment, if we define A as the number of true
We use MSVM-kNN approach as our classifier. In positive samples predicted as positive, B as the number
the classifier, our improvements are concentrated on of true positive samples predicted as negative, C as the
the following points: First, the intra-samples are number of true negative samples predicted as positive
gradually discarded at a certain ratio. The training set and D as the number of true negative samples predicted
and the storage cost of the old sample are reduced by as negative, then Precision, Recall, F-measure can be
this way. Second, some vice-boundary samples are expressed as follow.
optimally introduced into the training set to accelerate A
the convergence of the final SV set searching Pr ecision = (4)
A+C
procedure. A
In order to reduce training time and other resource Re call = (5)
A+B
requirement of the classifier, the classifier of ALC 2Pr ecision*Recall
manages three sample sets: working set, cache set and F-measure = (6)
Pr ecision + Re call
backup set. When the incremental training proceeds, In the experiments, we firstly use MSVM-kNN,
the distribution knowledge of samples accumulates SVM and k-NN to deal with the data set - 20
gradually, and contributions of some samples to the newsgroups. Table 1 shows the F-measure of three text
classifier also change. Therefore, these samples can be categorization approaches on the 20 newsgroups.
moved from one set to another. Because of the different According to experimental results, MSVM-kNN
contributions to the final classifier, the characteristics performs better than k-NN or SVM in 16 categories of
of these three kinds of sample sets can be used to 20 newsgroups. In category sci.space, the F-measure of
optimize the training procedure. When the samples in

137
MSVM-kNN reaches 0.99, however, in categories better than k-NN or SVM in most of the categories
talk.politics.misc, talk.religion.misc and comp.os.ms- except data.
windows.misc, the F-measures of MSVM-kNN are Test Result on 20 Newsgroups
0.75 and 0.77. Those F-measures are the lowest value
in 20 newsgroups. But in these three categories, SVM Precision Recall F-measure
and k-NN also show a worst performance. One of the 1.00
reasons is that there are miscellaneous documents in 0.95
these categories. In the following four categories:
rec.sport.motorcycles, rec.sport.hockey, sci.med, and 0.90
soc.religion.christian, MSVM does not show better 0.85
performance than the other two approaches. However, 0.80
the F-measure of MSVM-kNN is not too bad. In three
categories except category sci.med, the differences of 0.75
F-measure of MSVM-kNN and SVM are 0.01. 0.70
Table 1 F-measure of MSVM-kNN, SVM and k- 0.65
NN with 20 newsgroups 0.60
Algorithms
Categories k-NN SVM MSVM-kNN 0.55
alt.atheism 0.78 0.79 0.80 0.50
comp.graphics 0.81 0.82 0.85 k-NN SVM MSVM-kNN
comp.os.ms-windows.misc 0.65 0.64 0.75 Figure 3 Performances of MSVM-kNN, SVM, k-
comp.sys.ibm.pc.hardware 0.76 0.79 0.83 NN on 20-Newsgroups
comp.sys.mac.hardware 0.91 0.90 0.92
comp.windows.x 0.82 0.83 0.89 Table 2 F-measure of MSVM-kNN, SVM, k-NN
misc.forsale 0.84 0.89 0.92 with ACM data set
rec.autos 0.91 0.96 0.98 Algorithms
rec.motorcycles 0.87 0.98 0.97 Categories k-NN SVM MSVM-kNN
rec.sport.baseball 0.86 0.96 0.98
rec.sport.hockey 0.92 0.98 0.97 Hardware 0.80 1.00 1.00
sci.crypt 0.89 0.93 0.93 Computer Systems
0.89 0.73 0.91
Organization
sci.electronics 0.82 0.93 0.94
sci.med 0.86 0.96 0.93
Software 0.89 0.89 0.89
sci.space 0.81 0.97 0.99 Data 0.73 0.89 0.83
soc.religion.christian 0.83 0.94 0.93 Thoery of Computation 0.89 0.91 0.91
talk.politics.guns 0.86 0.94 0.94 Mathematics of
0.56 0.67 0.75
talk.politics.mideast 0.83 0.88 0.91 Computing
talk.politics.misc 0.72 0.75 0.75 Information Systems 0.73 0.73 0.89
talk.religion.misc 0.64 0.32 0.77 Computing
0.83 0.91 0.91
Average 0.82 0.86 0.90 Methodologies
Computer Applications 0.75 0.89 0.89
Figure 3 shows the results of Precision, Recall, F- Average 0.79 0.85 0.89
measure of MSVM-kNN, SVM and k-NN on 20
newsgroups data set. The Precision, Recall of MSVM- Figure 4 shows Precision, Recall, F-measure of
kNN are 0.91 and 0.90, respectively. However, The MSVM-kNN, SVM and k-NN on ACM data set. The
Precision, Recall of SVM, kNN are 0.86 and 0.88, 0.82 results also show MSVM-kNN achieves better
and 0.83, respectively. According to the results, Precision, Recall than single SVM or k-NN: The
MSVM-kNN achieves better Precision and Recall than Precision, Recall of MSVM-kNN are 0.90 and 0.89;
single SVM or k-NN does. However, the Precision, Recall of SVM, kNN are 0.86
Secondly, we compare the performances of three and 0.84, 0.82, and 0.78, respectively.
approaches using ACM data set which contain 225 According to those experimental results mentioned
documents. The F-measure comparison of three above, MSVM-kNN achieves better performance in
approaches on ACM data set is shown in Table 2. The most cases regardless of 20 newsgroups or ACM data
experimental results show MSVM-kNN performs set. We can see that the Precision, Recall and F-
measure on the two data sets have been improved in

138
various degrees. The average F-measure on 20 [1] F. Sebastiani, “Machine Learning in Automated Text
newsgroups is 0.90 while the best average F-measure Categorization”, Technical Report: Centre National de
of other two methods is 0.86. The average F-measure la Recherche Scientifique. 1999
of MSVM-kNN, SVM, and kNN on ACM data set are [2] R. Duwairi, “An eager k-nearest-neighbor classifier for
Arabic text categorization”, Proceedings of the
0.89, 0.85, and 0.80. Moreover, the best, worst F- International Conference on Data Mining (ICDM’05),
Measure on 20 newsgroups dataset of MSVM-kNN are Nevada, USA, 2005, pp.187-192
0.75 and 0.99. But the best, worst F-Measure on 20 [3] J. Cervantes, X. Li, W. Yu, and K. Li, “Support vector
newsgroups dataset of SVM, kNN are 0.32 and 0.98, machine classification for large data sets via minimum
0.64 and 0.92. Therefore, MSVM-kNN is more stable enclosing ball clustering”, Neurocomputing, Vol.71,
than the other approaches. No.4, 2008, pp.611-619
Test Result on Data from ACM [4] Y. S. Xia and J. Wang, “A one-layer recurrent neural
network for support vector machine learning”, IEEE
Precision Recall F-measure Trans. Syst. Man Cybern. B, 2004, pp.1261-1269
[5] E. Gabrilovich and S. Markovitch, “Text Categorization
1.00
with Many Redundant Features: Using Aggressive
0.95 Feature Selection to Make SVMs Competitive with
0.90 C4.5”, Proceedings of The Twenty-First International
Conference on Machine Learning, Banff, Alberta,
0.85 Canada, 2004, Morgan Kaufmann, pp.321-328
0.80 [6] Y. Yang, J. Zhang, and B. Kisiel, “A scalability
analysis of classifiers in text categorization”,
0.75 Proceedings of the 26th Annual International ACM
0.70 SIGIR conference on Research and Development in
Information Retrieval, Toronto, Canada, ACM Press,
0.65 2003, pp.96-103
0.60 [7] Y. Y. Chung, E. H. C. Choi, L. Liu, M. A. Shhukran, D.
Y. Shi, and F. Chen, “A New Hybrid Audio
0.55 Categorization Algorithm Based on SVM Weight
0.50 Factor and Euclidean Distance”, Proc. Of the 2007
k-NN SVM MSVM-kNN WSEAS International Conference on Computer
Engineering and Applications, Gold Coast, Australia,
Figure 4 Performances of MSVM-kNN, SVM, k- 2007, pp.152-158
NN on ACM data set [8] M. Pazzani, J. Muramatsu, and D. Billsus, “Syskill &
Webert: identifying interesting web sites”, Proc. of the
Thirteenth Amer. Nat. Conf. on Artificial Intelligence
6. Conclusion (AAAI-96), Vol.1. AAAI Press, Portland, OR, 1996,
pp.54-61
Although automatic text categorization has become [9] G. Ou and Y. L. Murphey, “Multi-class pattern
comparatively mature in some aspects now, and has classification using neural networks”, Pattern
being used in many fields, those text categorization Recognition, Vol.40, No.1, January 2007, pp.4-18
approach are not performed well in all fields and it is [10] T. Joachims, “A statistical learning model of text
difficult to improve performance of single text classification with support vector machines”,
Proceedings of the 24th ACM SIGIR Conference on
categorization approach. Therefore, it is necessary to
Research and Development in Information Retrieval
combine two or more text categorization approaches (SIGIR), 2001, pp.128-136
together to categorize a data set. [11] O. Chapelle and A. Zien, “Semi-supervised
In this paper, we give a brief overview of the text classification by low density separation”, Proceedings
categorization, and propose a text categorization of the Tenth International Workshop on Artificial
approach named as Multi-Class SVM-kNN (MSVM- Intelligence and Statistics, 2005
kNN) which is combination of SVM and k-NN. In the [12] J. Weston and C. Watkins, “Multi-class support vector
approach, the term weight is assigned according to machines”, Technical Report CSD-TR-9800-04,
information entropy. Then SVM is used to classify Department of Computer Science, Royal Holloway,
Univ. London, 1998
samples and kNN is used to deal with indivisible case.
[13] J. C. Platt, “Probabilities for Support Vector
The experimental results show that MSVM-kNN has Machines”, Advances in Large Margin Classifiers, MIT
better performance in most cases than SVM and k-NN. Press, 1999, pp.61-74
[14] B. Schölkopf and A. J. Smola, Learning with Kernels:
Support Vector Machines, Regularization,
References Optimization, and Beyond, MIT Press, Cambridge MA,
1st edition, December 15, 2001

139
[15] S. Gao, W. Wu, C.-H. Lee, and T.-S. Chua, “A MFoM classification”, Information Sciences, Vol.177, No.18,
Learning Approach to Robust Multiclass Multi-Label 2007, pp.3782-3798
Text Categorization”, Proceedings of the twenty-first [20] H. Jin and H. Wu, “Semantic metadata models in
international conference on Machine learning, Banff, references sharing and retrieval system-SemreX”, Proc.
Alberta, Canada. 2004 of 1st Int. Conf. on Grid and Pervasive Computing,
[16] T.-Y. Wang and H.-M. Chiang, “Fuzzy support vector May 3-5, 2006, pp.437-446
machine for multi-class text categorization”, [21] X. M. Ning, H. Jin, and H. Wu, “SemreX: Towards
Information Processing and Management: an Large-scale Literature Information Retrieval and
International Journal, Vol.43, No.4, July 2007, pp.914- Browsing with Semantic Association”, Proceedings of
929 2nd IEEE International Symposium on Service-
[17] P. Zhang and J. Peng, “SVM vs regularized least Oriented Applications, Integration and Collaboration
squares categorization”, Proceedings of the 17th (SOAIC'06), Shanghai, China, 2006, pp.602-609
International Conference on Pattern Recognition, 2004, [22] L. Shih, J. D. M. Rennie, Y.-H. Chang, and D. R.
pp.176-179 Karger, “Text bundling: statistics-based data
[18] J. Zhang and Y. Yang, “Robustness of regularized reduction”, Proceedings of the Twentieth International
linear categorization methods in text categorization”, Conference on Machine Learning (ICML’03), August
Proceedings of the 26th Annual International ACM 21-24, 2003, Washington, DC, USA, AAAI Press, 2003,
SIGIR Conference on Research and Development in pp.696-703
Information Retrieval, July 28-August 1, 2003, [23] Q. Zhang, L. Zhang, S. Dong, and J. Tan, “Document
Toronto, Canada. ACM 2003, pp.190-197 indexing in text categorization”, Proc. of the 4th
[19] P. Lingras and C. Butz, “Rough set based 1-v-1 and 1- International Conference on Machine Learning and
v-r approaches to support vector machine multi- Cybernetics, Guangzhou, 2005, pp.392-396

140

Researchpaperclassification IEEEprocedding 1
No ratings yet
Researchpaperclassification IEEEprocedding 1
7 pages
Review 3 - Journal Submission Format: Team Number Title (New)
No ratings yet
Review 3 - Journal Submission Format: Team Number Title (New)
28 pages
Classification of Cyber Attacks Using Support Vector Machine
100% (1)
Classification of Cyber Attacks Using Support Vector Machine
4 pages
Theis Finaldoc
No ratings yet
Theis Finaldoc
86 pages
ML.4-Classification Techniques (Week 5,6,7)
No ratings yet
ML.4-Classification Techniques (Week 5,6,7)
56 pages
Comparison of Supervised Classification Models On Textual Data
No ratings yet
Comparison of Supervised Classification Models On Textual Data
16 pages
Techniques of Text Classification
No ratings yet
Techniques of Text Classification
28 pages
A Comparative Study On Different Types of Approaches To The Arabic Text Classification
No ratings yet
A Comparative Study On Different Types of Approaches To The Arabic Text Classification
12 pages
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
No ratings yet
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
20 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
Lecture#11
No ratings yet
Lecture#11
19 pages
Ijcsea 2
No ratings yet
Ijcsea 2
13 pages
MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries
No ratings yet
MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries
25 pages
Unit 2
No ratings yet
Unit 2
26 pages
Job Opportunity Finding by Text Classification: Procedia Engineering
No ratings yet
Job Opportunity Finding by Text Classification: Procedia Engineering
5 pages
Book Genre Categorization Using Machine Learning Algorithms (K-Nearest Neighbor, Support Vector Machine and Logistic Regression) Using Customized Dataset
No ratings yet
Book Genre Categorization Using Machine Learning Algorithms (K-Nearest Neighbor, Support Vector Machine and Logistic Regression) Using Customized Dataset
12 pages
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
No ratings yet
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
9 pages
A Review of Supervised Learning Based Classification For Text To Speech System
No ratings yet
A Review of Supervised Learning Based Classification For Text To Speech System
8 pages
Tan 2021 J. Phys. Conf. Ser. 1994 012016
No ratings yet
Tan 2021 J. Phys. Conf. Ser. 1994 012016
6 pages
Nepali News Classification
No ratings yet
Nepali News Classification
5 pages
Machine Learning Algorithms in Web Page Classification
No ratings yet
Machine Learning Algorithms in Web Page Classification
9 pages
Kadhim 2019
No ratings yet
Kadhim 2019
20 pages
A Class-Incremental Learning Method For Multi-Class Support Vector Machines in Text Classification
No ratings yet
A Class-Incremental Learning Method For Multi-Class Support Vector Machines in Text Classification
5 pages
Ijermt Jan2019
No ratings yet
Ijermt Jan2019
9 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
No ratings yet
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
7 pages
Review On Network Intrusion Detection Techniques Using Machine Learning
No ratings yet
Review On Network Intrusion Detection Techniques Using Machine Learning
6 pages
Text Classification Research With Attention-Based Recurrent Neural Networks
No ratings yet
Text Classification Research With Attention-Based Recurrent Neural Networks
12 pages
111 1460444112 - 12-04-2016 PDF
No ratings yet
111 1460444112 - 12-04-2016 PDF
7 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Machine Learning For Text Document Classification-Efficient Classification Approach
No ratings yet
Machine Learning For Text Document Classification-Efficient Classification Approach
8 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
A Survey On Machine Learning Techniques
No ratings yet
A Survey On Machine Learning Techniques
8 pages
Building A K-Nearest Neighbor Classifier For Text Categorization
No ratings yet
Building A K-Nearest Neighbor Classifier For Text Categorization
3 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
12 pages
Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
Document Classification Using Distributed Machine Learning
No ratings yet
Document Classification Using Distributed Machine Learning
4 pages
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
16 pages
Article Classification Using Natural Language Processing and Machine Learning
No ratings yet
Article Classification Using Natural Language Processing and Machine Learning
8 pages
Research On Short Text Classification Based On Tex
No ratings yet
Research On Short Text Classification Based On Tex
8 pages
Proceedings of International Symposium
No ratings yet
Proceedings of International Symposium
1 page
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Support Vector Machines For Text Categorization Based On Latent Semantic Indexing
No ratings yet
Support Vector Machines For Text Categorization Based On Latent Semantic Indexing
4 pages
Machine Learning Approach To Document Classificati
No ratings yet
Machine Learning Approach To Document Classificati
5 pages
Automatic Induction of Rule Based Text Categorization
No ratings yet
Automatic Induction of Rule Based Text Categorization
10 pages
Becker and Kuropka - Topic-Based Vector Space Model PDF
No ratings yet
Becker and Kuropka - Topic-Based Vector Space Model PDF
6 pages
Review On Comparison Between Text Classification Algorithms
No ratings yet
Review On Comparison Between Text Classification Algorithms
4 pages
KNN With Tf-Idf Based Framework For Text Categorization: Sciencedirect
No ratings yet
KNN With Tf-Idf Based Framework For Text Categorization: Sciencedirect
9 pages
05 Text Categorization
No ratings yet
05 Text Categorization
22 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
A Survey On Text Categorization: International Journal of Computer Trends and Technology-volume3Issue1 - 2012
No ratings yet
A Survey On Text Categorization: International Journal of Computer Trends and Technology-volume3Issue1 - 2012
7 pages
Machine Learning Models For News Article Classification
No ratings yet
Machine Learning Models For News Article Classification
8 pages
Article 18 Colas
No ratings yet
Article 18 Colas
10 pages
Margin-Based Active Learning and Background Knowledge in Text Mining
No ratings yet
Margin-Based Active Learning and Background Knowledge in Text Mining
6 pages
A Study On The Architecture For Text Categorization and Summarization
No ratings yet
A Study On The Architecture For Text Categorization and Summarization
4 pages
Petroskills Online Facilitiescatalog 2016-17
No ratings yet
Petroskills Online Facilitiescatalog 2016-17
46 pages
The Art of Teaching Spanish
No ratings yet
The Art of Teaching Spanish
257 pages
900A - Phan de Test - 1
No ratings yet
900A - Phan de Test - 1
10 pages
DCRUST B.tech First Counseling Results
No ratings yet
DCRUST B.tech First Counseling Results
72 pages
Cpe Use of English Examination Practice Teachers Guide
0% (1)
Cpe Use of English Examination Practice Teachers Guide
23 pages
District Educational Office:: Warangal Teachers Online Particulars
No ratings yet
District Educational Office:: Warangal Teachers Online Particulars
4 pages
Text Book Engish
67% (3)
Text Book Engish
64 pages
Internship Presentation Tybcom New Compress 1
No ratings yet
Internship Presentation Tybcom New Compress 1
15 pages
GSL Dictionary
No ratings yet
GSL Dictionary
290 pages
Movie Review Doctors in The Barrios
No ratings yet
Movie Review Doctors in The Barrios
2 pages
Abirami R - Internship - Report
No ratings yet
Abirami R - Internship - Report
26 pages
Time Connectives Homework Ks1
100% (1)
Time Connectives Homework Ks1
8 pages
Revised Advertisement
No ratings yet
Revised Advertisement
15 pages
The Roles of Personality Traits AI Anxiety and Demographic Factors in Attitudes Toward Artificial Intelligence
No ratings yet
The Roles of Personality Traits AI Anxiety and Demographic Factors in Attitudes Toward Artificial Intelligence
19 pages
Direct Examination Questions For Court
No ratings yet
Direct Examination Questions For Court
9 pages
Clinical Ward Rotation Final Year MBBS (New File)
No ratings yet
Clinical Ward Rotation Final Year MBBS (New File)
1 page
Psychology 3
No ratings yet
Psychology 3
4 pages
Naukri GreeshmaKantipudi (1y 0m)
No ratings yet
Naukri GreeshmaKantipudi (1y 0m)
1 page
Admission Ba Part 2
No ratings yet
Admission Ba Part 2
1 page
3P CEW545 Rubrics Level 3P-Ad Hoc-Covid19
No ratings yet
3P CEW545 Rubrics Level 3P-Ad Hoc-Covid19
2 pages
Ela - The Quilt Story Lesson Plan
No ratings yet
Ela - The Quilt Story Lesson Plan
3 pages
Week 3
No ratings yet
Week 3
12 pages
2nd Counselling 2011
No ratings yet
2nd Counselling 2011
3 pages
PSYCHOPHYSIOLOGICALPERSPECTIVESONANXIETYWord 97
No ratings yet
PSYCHOPHYSIOLOGICALPERSPECTIVESONANXIETYWord 97
50 pages
ME 3217 Meta Cutting
No ratings yet
ME 3217 Meta Cutting
6 pages
Fe 1 Grids
No ratings yet
Fe 1 Grids
12 pages
Jean Jacques Rousseau - Excerpts From Emile On Education
No ratings yet
Jean Jacques Rousseau - Excerpts From Emile On Education
6 pages
Literature Review Table - Demo
No ratings yet
Literature Review Table - Demo
2 pages
BACmove Overview Bacnet App Explorer HMI
No ratings yet
BACmove Overview Bacnet App Explorer HMI
1 page
Facilities Data and State Rated Capadty School Year 201:4-2015
No ratings yet
Facilities Data and State Rated Capadty School Year 201:4-2015
3 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

(IEEE Semantic 2008 Pingpen Yuan) MSVM-KNN Multi-Class Text Classification

Uploaded by

(IEEE Semantic 2008 Pingpen Yuan) MSVM-KNN Multi-Class Text Classification

Uploaded by

IEEE International Workshop on Semantic Computing and Systems

MSVM-kNN: Combining SVM and k-NN for Multi-Class Text Classification*

Pingpeng Yuan, Yuqin Chen, Hai Jin, Li Huang

Abstract plays the key role in organizing the massive sources of

978-0-7695-3316-2/08 $25.00 © 2008 IEEE 133

You might also like