(IEEE Semantic 2008 Pingpen Yuan) MSVM-KNN Multi-Class Text Classification
(IEEE Semantic 2008 Pingpen Yuan) MSVM-KNN Multi-Class Text Classification
134
sample space to a transform space so as to find a indivisible problem. k-NN is adopted to classify
solution for nonlinear problems. indivisible documents after SVM is used to classify
Compared to other categorization methods, SVM has documents. Before introducing the incremental
apparent advantages in avoiding over-fitting, slow learning method and indivisible documents
computation speed and rough precision of the result. classification, we describe term weight assignment of
But it has weaknesses, mainly the following aspects. our approach which is the base of classification.
SVM is actually a binary classifier. However in
practice many problems are multi-class categorization Information entropy based term weight
problem. So SVM can not directly used in solving assignment
multi-categorization [7]. Besides, improving SVM
categorization speed and reducing the training samples In text categorization, the commonly document
are unsolved problems. representation scheme is vector space model, in which
the structure of a document and the order of words in
3.2 k-Nearest Neighbor the document are ignored. The feature vectors represent
the words observed in the documents. Every term in the
k-NN is an example-based text categorization feature vectors is assigned weights.
algorithm. When k-NN is applied to text categorization, The weights of the terms are generally assigned
it calculates the cosine of angle between the test text using the standard tf*idf which is shown as formula 1.
and each sample text in training set to get the
W tfidf ( t , d ) = tf ( t , d ) × log( N n t + 0.01) (1)
similarities, then takes k most similar training texts
according to their similarities, and produces scores for Here tf is term frequency, and idf is inversed
each category based on these similarities. Finally, rank document frequency. The formula 1 is widely used in
these scores in the order. The test text belongs to the information retrieval and text categorization. However,
biggest score category. the contribution of terms is not introduced in formula 1.
k-NN is suitable for little data set and can achieve In fact, the importance of a term depends on not only
better performance. However, the determination of the its term frequency and document frequency, but also its
k has not yet got good solution. So k-NN usually takes contribution to classification in text categorization. For
an initial value first, then adjusts the k value according example, term t1 and t2 have the same term frequency
to experimental results. in document d1 and the same document frequency in
The good selection of k most similar texts also has the training set, but the documents in which t1 occurs
bigger effect on categorization results. However in only appear in a category and those documents in
practical applications, samples generally have the which t2 occurs uniformly appear in each category.
following characteristics [18]: the samples of the same Obviously, t1 is more important than t2 for
category samples show multi-peaks distribution, categorization, and t1 should have higher weight than
therefore the distances between samples of the same t2. However, if we compute weights by the standard
category are likely larger than the ones between tfidf function, weight of t1 is equal to t2.
different categories. When the distribution of the Considering the relationship between terms and
samples is not very compact, many samples will be categories, we calculate the distribution of those
nearing the borders of the categories, many documents in categories in course of weighting terms.
misclassified cases will occur. By this way, the This distribution can be weighed by information
algorithm would lead to inaccuracy in many cases. entropy H, which is defined as:
Therefore k-NN algorithm cannot effectively solve the d f ik d f ik
H ( t i , d j ) = − ∑ k =1
C
lo g (2)
problem of overlapped categories borders. d n (ti) d n (ti)
where dfik denotes the document frequency of ti, dn(ti)
4. The Multi–class SVM-kNN Approach denotes the number of documents in category c in
which ti occurs, and |C| denotes the number of
In this section, we describe our approach, Multi- categories.
class SVM-kNN (MSVM-kNN) which is the Some research show that, the more uniformly those
combination of SVM and k-NN. In general text documents embodying ti, the higher the value of H is;
categorization, the popular approaches of mutli-class and the less uniformly those documents distribute, the
SVM are 1-vs-1 and 1-vs-r (rest) [19]. Both of them are lower the value of H is. When and only when the
time consuming, and may have indivisible cases. To probability of documents in each category is equal, the
reduce the training time, we adopt the incremental value of H is the highest. So, we propose a method for
learning method in MSVM-kNN to solve the computing term weights, tfidfh, that is:
135
tf ( t , d ) × log( N n t + 0.01) As we can see above, the equivalence between the
W tfidfh ( t , d ) = (3) SV set and training set is broken when the new samples
H (t , d )
are introduced into the training set in the procedure of
This function embodies the intuition that the more
the incremental learning. Thus, on one hand the
frequently a term occurs in a document, the more it is
original SV set can not fully describe the categorization
representative of its content, the more documents a
properties of the new training set. On the other hand,
term occurs, the less discriminating it is. The more
with the introduction of the new samples, some old
uniformly those documents containing a term distribute
samples must be discarded optimally to reduce the
in a training set, the looser its relationship with
storage cost. Therefore, two major following problems
categories is. In the forgoing example, when the tfidfh
will be met in the incremental SVM learning algorithm:
function is used to compute the term weights, t1 has the
(1) how to construct the new SV set from the initial
higher value than t2.
training set; (2) how to discard old samples optimally.
Text categorization is often characterized by the high
Traditional SVM algorithms discard all previous
dimensionality of the associated feature space and a
training results and re-train the new classifier on the
relatively small number of training samples. One way
whole data set in the case of incremental learning. This
to reduce the number of features in document
method is so slow that it cumbers the application of
categorization is to select a subset of the best terms
SVM learning algorithm greatly in the area of
from the entire feature space [12]. In this paper, we use
incremental learning. Here we describe a new scheme,
weight based best n features selection approach, in
which uses an iterative method to find out the optimal
which terms are sorted by their weights and the top n
convergent result on the whole training set. This
terms with the highest weights are selected. We
scheme can be described as follows: First, the old
compute the information gain for each unique term,
classifier is tested on the new incremental sample set.
and choose the top n as the feature terms of the
Those samples classified incorrectly is combined with
documents.
the current SV set to construct a new training set, and
the rest samples form a new test set. Then a new SVM
Incremental learning for multi-class SVM classifier is trained on the new training set, and it can
use the new test set to repeat previous operations. The
Vapnik [1] has proven that the training result of process continues until all data points are classified
SVM only depends on a small set of samples, which is correctly.
so-called Support Vector (SV), and the normal of the
optimal hyperplane HP can be expressed as the 4.3 k-NN based indivisible cases classification
combination of the SVs. Usually, the SV set is only a
small portion of the training set. Thereby if we train the For multi-class categorization using SVM, we must
SVM on the SV set instead of the whole training set, make several binary classifiers to construct the multi-
the training time can be reduced greatly without loss of class model needed. However, these binary classifiers
much categorization precision. According to the idea, divide the sample space into several zones. Some zones
we divide the samples of the SV set into two classes: can be identified as a category. But the other zones
BSV (Boundary Support Vector) and NSV (Normal may locate in multi-hyperplanes. Classifiers based
Support Vector). BSV corresponds to the vectors of SVM can not identify samples in these zones
documents which can not be classified correctly. These effectively.
two kinds of SVs determine the results of the classifier To solve this problem, we introduce k-NN into the
together. classifier. If there is indivisible document, we find k
Experiments also show that these features do not similar neighbors of the test document from the SVs,
always remain the same. When the successive training and then produce a score for each category based on
samples are introduced, BSV, NSV, and normal these similarities. The score value is the sum of the
samples can inter-transform. For example, due to the similarities of the texts in the training sets and belongs
limitation of the initial training set, the samples that to the same category. The test document belongs to the
represent the main categorization information may be biggest score category. By this way, we need not
miss-classified as BSV. As the accumulation of calculate the similarities of all the samples, reduce the
knowledge about the categorization in the successive time consumption, and solve the indivisible case.
incremental training, the contribution of these samples
to categorization will be clear. At the same time, these
BSVs are gradually degraded to NSVs or normal 5. Implementation and Experimental
samples. Results
136
5.1 Implementation the backup set do not affect the result of the finial
classifier, they can be discarded gradually and the
We have implemented an automatic literature remaining samples can be used as the samples of the
categorization system in SemreX [20, 21]. The goal of test set to validate the new trained classifier. If the
automatic literature categorization system (ALC) is to samples in the working set are actively involved in the
classify a set of documents into a fixed number of computation of the final classifier, it can be directly
predefined categories. The whole categorization introduced into the training set of the new classifier to
process can be divided into three stages: text accelerate the training procedure. In general, the
processing stage, training stage, and testing stage. In contribution of cache set to the final classifier is
text processing stage, ALC uses vector space model to between the backup set and working set. The samples
represent the documents (both training set and testing in the caching set affect the previous categorization
set). Unlabeled document must be preprocessed and result. So they can be selected into the training set of
represented by VSM before labeling it [22]. In second the new classifier according to a certain scheme and
stage, ALC learns the training set and constructs priority.
classifier based on MSVM-kNN. Finally, we use the
classifier to classify unlabeled documents. Figure 2 5.2 Experimental results and analysis
shows the overall flow diagram of the ALC.
For our experiments, we use two data sets: one is a
Text Processing Training well-known data set - 20 newsgroups dataset; the
second data set is consisted of literatures collected from
ACM, which is named as ACM data set here. The 20
newsgroups data set consists of 20 different news
TrainingSet
Training groups. Each group contains approximately 1000
Classifier
algorithm documents. In this paper, we select 19,996 documents
Preprocess
Feature of them. The ACM data set contains nine categories, 25
Selection
documents each category.
In our experiments, we compare the performance of
Testing set our approach MSVM-kNN with the performance of
Categories SVM and k-NN. Before presenting the experimental
results, we introduce performance measures commonly
used to evaluate categorization methods. The
performance measures include Precision, Recall, F-
Testing measure.
Figure 2 Flow Diagram of ALC The performance measures are computed as follows.
In one experiment, if we define A as the number of true
We use MSVM-kNN approach as our classifier. In positive samples predicted as positive, B as the number
the classifier, our improvements are concentrated on of true positive samples predicted as negative, C as the
the following points: First, the intra-samples are number of true negative samples predicted as positive
gradually discarded at a certain ratio. The training set and D as the number of true negative samples predicted
and the storage cost of the old sample are reduced by as negative, then Precision, Recall, F-measure can be
this way. Second, some vice-boundary samples are expressed as follow.
optimally introduced into the training set to accelerate A
the convergence of the final SV set searching Pr ecision = (4)
A+C
procedure. A
In order to reduce training time and other resource Re call = (5)
A+B
requirement of the classifier, the classifier of ALC 2Pr ecision*Recall
manages three sample sets: working set, cache set and F-measure = (6)
Pr ecision + Re call
backup set. When the incremental training proceeds, In the experiments, we firstly use MSVM-kNN,
the distribution knowledge of samples accumulates SVM and k-NN to deal with the data set - 20
gradually, and contributions of some samples to the newsgroups. Table 1 shows the F-measure of three text
classifier also change. Therefore, these samples can be categorization approaches on the 20 newsgroups.
moved from one set to another. Because of the different According to experimental results, MSVM-kNN
contributions to the final classifier, the characteristics performs better than k-NN or SVM in 16 categories of
of these three kinds of sample sets can be used to 20 newsgroups. In category sci.space, the F-measure of
optimize the training procedure. When the samples in
137
MSVM-kNN reaches 0.99, however, in categories better than k-NN or SVM in most of the categories
talk.politics.misc, talk.religion.misc and comp.os.ms- except data.
windows.misc, the F-measures of MSVM-kNN are Test Result on 20 Newsgroups
0.75 and 0.77. Those F-measures are the lowest value
in 20 newsgroups. But in these three categories, SVM Precision Recall F-measure
and k-NN also show a worst performance. One of the 1.00
reasons is that there are miscellaneous documents in 0.95
these categories. In the following four categories:
rec.sport.motorcycles, rec.sport.hockey, sci.med, and 0.90
soc.religion.christian, MSVM does not show better 0.85
performance than the other two approaches. However, 0.80
the F-measure of MSVM-kNN is not too bad. In three
categories except category sci.med, the differences of 0.75
F-measure of MSVM-kNN and SVM are 0.01. 0.70
Table 1 F-measure of MSVM-kNN, SVM and k- 0.65
NN with 20 newsgroups 0.60
Algorithms
Categories k-NN SVM MSVM-kNN 0.55
alt.atheism 0.78 0.79 0.80 0.50
comp.graphics 0.81 0.82 0.85 k-NN SVM MSVM-kNN
comp.os.ms-windows.misc 0.65 0.64 0.75 Figure 3 Performances of MSVM-kNN, SVM, k-
comp.sys.ibm.pc.hardware 0.76 0.79 0.83 NN on 20-Newsgroups
comp.sys.mac.hardware 0.91 0.90 0.92
comp.windows.x 0.82 0.83 0.89 Table 2 F-measure of MSVM-kNN, SVM, k-NN
misc.forsale 0.84 0.89 0.92 with ACM data set
rec.autos 0.91 0.96 0.98 Algorithms
rec.motorcycles 0.87 0.98 0.97 Categories k-NN SVM MSVM-kNN
rec.sport.baseball 0.86 0.96 0.98
rec.sport.hockey 0.92 0.98 0.97 Hardware 0.80 1.00 1.00
sci.crypt 0.89 0.93 0.93 Computer Systems
0.89 0.73 0.91
Organization
sci.electronics 0.82 0.93 0.94
sci.med 0.86 0.96 0.93
Software 0.89 0.89 0.89
sci.space 0.81 0.97 0.99 Data 0.73 0.89 0.83
soc.religion.christian 0.83 0.94 0.93 Thoery of Computation 0.89 0.91 0.91
talk.politics.guns 0.86 0.94 0.94 Mathematics of
0.56 0.67 0.75
talk.politics.mideast 0.83 0.88 0.91 Computing
talk.politics.misc 0.72 0.75 0.75 Information Systems 0.73 0.73 0.89
talk.religion.misc 0.64 0.32 0.77 Computing
0.83 0.91 0.91
Average 0.82 0.86 0.90 Methodologies
Computer Applications 0.75 0.89 0.89
Figure 3 shows the results of Precision, Recall, F- Average 0.79 0.85 0.89
measure of MSVM-kNN, SVM and k-NN on 20
newsgroups data set. The Precision, Recall of MSVM- Figure 4 shows Precision, Recall, F-measure of
kNN are 0.91 and 0.90, respectively. However, The MSVM-kNN, SVM and k-NN on ACM data set. The
Precision, Recall of SVM, kNN are 0.86 and 0.88, 0.82 results also show MSVM-kNN achieves better
and 0.83, respectively. According to the results, Precision, Recall than single SVM or k-NN: The
MSVM-kNN achieves better Precision and Recall than Precision, Recall of MSVM-kNN are 0.90 and 0.89;
single SVM or k-NN does. However, the Precision, Recall of SVM, kNN are 0.86
Secondly, we compare the performances of three and 0.84, 0.82, and 0.78, respectively.
approaches using ACM data set which contain 225 According to those experimental results mentioned
documents. The F-measure comparison of three above, MSVM-kNN achieves better performance in
approaches on ACM data set is shown in Table 2. The most cases regardless of 20 newsgroups or ACM data
experimental results show MSVM-kNN performs set. We can see that the Precision, Recall and F-
measure on the two data sets have been improved in
138
various degrees. The average F-measure on 20 [1] F. Sebastiani, “Machine Learning in Automated Text
newsgroups is 0.90 while the best average F-measure Categorization”, Technical Report: Centre National de
of other two methods is 0.86. The average F-measure la Recherche Scientifique. 1999
of MSVM-kNN, SVM, and kNN on ACM data set are [2] R. Duwairi, “An eager k-nearest-neighbor classifier for
Arabic text categorization”, Proceedings of the
0.89, 0.85, and 0.80. Moreover, the best, worst F- International Conference on Data Mining (ICDM’05),
Measure on 20 newsgroups dataset of MSVM-kNN are Nevada, USA, 2005, pp.187-192
0.75 and 0.99. But the best, worst F-Measure on 20 [3] J. Cervantes, X. Li, W. Yu, and K. Li, “Support vector
newsgroups dataset of SVM, kNN are 0.32 and 0.98, machine classification for large data sets via minimum
0.64 and 0.92. Therefore, MSVM-kNN is more stable enclosing ball clustering”, Neurocomputing, Vol.71,
than the other approaches. No.4, 2008, pp.611-619
Test Result on Data from ACM [4] Y. S. Xia and J. Wang, “A one-layer recurrent neural
network for support vector machine learning”, IEEE
Precision Recall F-measure Trans. Syst. Man Cybern. B, 2004, pp.1261-1269
[5] E. Gabrilovich and S. Markovitch, “Text Categorization
1.00
with Many Redundant Features: Using Aggressive
0.95 Feature Selection to Make SVMs Competitive with
0.90 C4.5”, Proceedings of The Twenty-First International
Conference on Machine Learning, Banff, Alberta,
0.85 Canada, 2004, Morgan Kaufmann, pp.321-328
0.80 [6] Y. Yang, J. Zhang, and B. Kisiel, “A scalability
analysis of classifiers in text categorization”,
0.75 Proceedings of the 26th Annual International ACM
0.70 SIGIR conference on Research and Development in
Information Retrieval, Toronto, Canada, ACM Press,
0.65 2003, pp.96-103
0.60 [7] Y. Y. Chung, E. H. C. Choi, L. Liu, M. A. Shhukran, D.
Y. Shi, and F. Chen, “A New Hybrid Audio
0.55 Categorization Algorithm Based on SVM Weight
0.50 Factor and Euclidean Distance”, Proc. Of the 2007
k-NN SVM MSVM-kNN WSEAS International Conference on Computer
Engineering and Applications, Gold Coast, Australia,
Figure 4 Performances of MSVM-kNN, SVM, k- 2007, pp.152-158
NN on ACM data set [8] M. Pazzani, J. Muramatsu, and D. Billsus, “Syskill &
Webert: identifying interesting web sites”, Proc. of the
Thirteenth Amer. Nat. Conf. on Artificial Intelligence
6. Conclusion (AAAI-96), Vol.1. AAAI Press, Portland, OR, 1996,
pp.54-61
Although automatic text categorization has become [9] G. Ou and Y. L. Murphey, “Multi-class pattern
comparatively mature in some aspects now, and has classification using neural networks”, Pattern
being used in many fields, those text categorization Recognition, Vol.40, No.1, January 2007, pp.4-18
approach are not performed well in all fields and it is [10] T. Joachims, “A statistical learning model of text
difficult to improve performance of single text classification with support vector machines”,
Proceedings of the 24th ACM SIGIR Conference on
categorization approach. Therefore, it is necessary to
Research and Development in Information Retrieval
combine two or more text categorization approaches (SIGIR), 2001, pp.128-136
together to categorize a data set. [11] O. Chapelle and A. Zien, “Semi-supervised
In this paper, we give a brief overview of the text classification by low density separation”, Proceedings
categorization, and propose a text categorization of the Tenth International Workshop on Artificial
approach named as Multi-Class SVM-kNN (MSVM- Intelligence and Statistics, 2005
kNN) which is combination of SVM and k-NN. In the [12] J. Weston and C. Watkins, “Multi-class support vector
approach, the term weight is assigned according to machines”, Technical Report CSD-TR-9800-04,
information entropy. Then SVM is used to classify Department of Computer Science, Royal Holloway,
Univ. London, 1998
samples and kNN is used to deal with indivisible case.
[13] J. C. Platt, “Probabilities for Support Vector
The experimental results show that MSVM-kNN has Machines”, Advances in Large Margin Classifiers, MIT
better performance in most cases than SVM and k-NN. Press, 1999, pp.61-74
[14] B. Schölkopf and A. J. Smola, Learning with Kernels:
Support Vector Machines, Regularization,
References Optimization, and Beyond, MIT Press, Cambridge MA,
1st edition, December 15, 2001
139
[15] S. Gao, W. Wu, C.-H. Lee, and T.-S. Chua, “A MFoM classification”, Information Sciences, Vol.177, No.18,
Learning Approach to Robust Multiclass Multi-Label 2007, pp.3782-3798
Text Categorization”, Proceedings of the twenty-first [20] H. Jin and H. Wu, “Semantic metadata models in
international conference on Machine learning, Banff, references sharing and retrieval system-SemreX”, Proc.
Alberta, Canada. 2004 of 1st Int. Conf. on Grid and Pervasive Computing,
[16] T.-Y. Wang and H.-M. Chiang, “Fuzzy support vector May 3-5, 2006, pp.437-446
machine for multi-class text categorization”, [21] X. M. Ning, H. Jin, and H. Wu, “SemreX: Towards
Information Processing and Management: an Large-scale Literature Information Retrieval and
International Journal, Vol.43, No.4, July 2007, pp.914- Browsing with Semantic Association”, Proceedings of
929 2nd IEEE International Symposium on Service-
[17] P. Zhang and J. Peng, “SVM vs regularized least Oriented Applications, Integration and Collaboration
squares categorization”, Proceedings of the 17th (SOAIC'06), Shanghai, China, 2006, pp.602-609
International Conference on Pattern Recognition, 2004, [22] L. Shih, J. D. M. Rennie, Y.-H. Chang, and D. R.
pp.176-179 Karger, “Text bundling: statistics-based data
[18] J. Zhang and Y. Yang, “Robustness of regularized reduction”, Proceedings of the Twentieth International
linear categorization methods in text categorization”, Conference on Machine Learning (ICML’03), August
Proceedings of the 26th Annual International ACM 21-24, 2003, Washington, DC, USA, AAAI Press, 2003,
SIGIR Conference on Research and Development in pp.696-703
Information Retrieval, July 28-August 1, 2003, [23] Q. Zhang, L. Zhang, S. Dong, and J. Tan, “Document
Toronto, Canada. ACM 2003, pp.190-197 indexing in text categorization”, Proc. of the 4th
[19] P. Lingras and C. Butz, “Rough set based 1-v-1 and 1- International Conference on Machine Learning and
v-r approaches to support vector machine multi- Cybernetics, Guangzhou, 2005, pp.392-396
140