Kim 05 A
Kim 05 A
Abstract
Support vector machines (SVMs) have been recognized as one of the most successful classifica-
tion methods for many applications including text classification. Even though the learning ability
and computational complexity of training in support vector machines may be independent of the
dimension of the feature space, reducing computational complexity is an essential issue to effi-
ciently handle a large number of terms in practical applications of text classification. In this paper,
we adopt novel dimension reduction methods to reduce the dimension of the document vectors
dramatically. We also introduce decision functions for the centroid-based classification algorithm
and support vector classifiers to handle the classification problem where a document may belong to
multiple classes. Our substantial experimental results show that with several dimension reduction
methods that are designed particularly for clustered data, higher efficiency for both training and
testing can be achieved without sacrificing prediction accuracy of text classification even when the
dimension of the input space is significantly reduced.
Keywords: dimension reduction, support vector machines, text classification, linear discriminant
analysis, centroids
1. Introduction
Text classification is a supervised learning task for assigning text documents to pre-defined classes
of documents. It is used to find valuable information from a huge collection of text documents
available in digital libraries, knowledge databases, the world wide web (WWW), and company-wide
intranets, to name a few. Several characteristics have been observed in vector space based methods
for text classification (20; 21), including the high dimensionality of the input space, sparsity of
document vectors, linear separability in most text classification problems, and the belief that few
features are irrelevant. It has been conjectured that an aggressive dimension reduction may result in
a significant loss of information, and therefore, result in poor classification results (13).
Assume that training data (xi , yi ) with yi ∈ {−1, +1} for 1 ≤ i ≤ n are given. The dual formula-
tion of soft margin support vector machines (SVMs) with a kernel function K and control parameter
C is
n
1 n
max
αi
∑ αi − 2 i,∑
αi α j yi y j K(xi , x j ), (1)
i=1 j=1
n
s.t. ∑ αi yi = 0, 0 ≤ αi ≤ C, i = 1, . . . , n.
i=1
K(x, xi ) = exp(−γkx − xi k2 ),
where γ is a parameter to control. The evaluation of the kernel function depends on the dimension of
the input data, since the kernel functions contain the inner product of two input vectors for the linear
or polynomial kernels or the distance of two vectors for the Gaussian RBF kernel. Let α∗i denote
the optimal solution for (1). The optimal separating hyperplane f (x, α∗ , b) also requires evaluation
of the kernel function since
f (x, α∗ , b) = ∑ αi yi K(xi , x) + b
xi ∈SV
Therefore, more efficient testing as well as training is expected from dimension reduction.
Throughout the paper, we will assume that the document set is represented in an m × n term-
document matrix A = (ai j ), in which each column represents a document, and each entry ai j repre-
sents the weighted frequency of term i in document j (1; 2). The clustering of data is assumed to be
performed previously.
38
D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S
In the next section, we review Latent Semantic Indexing (LSI) (2; 1), which uses the truncated
singular value decomposition (SVD) as a low-rank approximation of A. Although the truncated SVD
provides the closest approximation to A in Frobenius or L2 norm, LSI ignores the cluster structure
while reducing the dimension of the data. In contrast, in Section 3, we review several dimension
reduction methods that are especially effective for classification of clustered data: two methods
based on centroids (16; 12), and one method which is a generalization of linear discriminant analysis
(LDA) using the generalized singular value decomposition (GSVD) (10). With dimension reduction,
computational complexity can be dramatically reduced for all classifiers including support vector
machines and k-nearest neighbor classification. For k-nearest neighbor classification (kNN), the
distances of vector pairs need to be computed when finding k nearest neighbors. Therefore, one can
significantly reduce computational complexity by dimension reduction.
In many document data sets, documents can be assigned to more than one cluster upon clas-
sification. To handle this problem more effectively, we introduce a threshold based extension of
several classification algorithms in Section 4. Our numerical experiments illustrate that the cluster-
preserving dimension reduction algorithms we employ reduce the data dimension without any sig-
nificant loss of information. In fact, in many cases, they seem to have the effect of noise reduction,
since prediction accuracy becomes better after dimension reduction when compared to that in the
original high dimensional input space.
, where the columns of Ul are the leading l left singular vectors, Σl is an l × l diagonal matrix with
the l largest singular values in nonincreasing order along its diagonal, and the columns of Vl are
the leading l right singular vectors. Then Σl VlT is the reduced dimensional representation of A, or
equivalently, a new document q ∈ Rm×1 can be represented in the l-dimensional space as q̂ = UlT q.
This low-rank approximation has been widely applied in information retrieval (2). Since the
complete orthogonal decomposition such as ULV or URV has computational advantages over the
SVD including easier updating (22; 23; 24) and downdating (17), dimension reduction by these
faster low-rank orthogonal decompositions has also been exploited (3). However, LSI ignores the
cluster structure while reducing the dimension. In addition, since there is no theoretical optimum
value for the reduced dimension, potentially expensive experimentation may be required to deter-
mine a reduced dimension l. As we report in Section 5, classification results after LSI vary de-
pending upon the reduced dimension, classification method, and similarity measure employed. The
experimental results confirm that when the data set is already clustered, the dimension reduction
methods we present in the next section are more effective for classification of new data.
39
K IM , H OWLAND AND PARK
4. q̂ = QTp q
A ≈ BY (2)
where B ∈ Rm×l with rank(B) = l and Y ∈ Rl×n with rank(Y ) = l. The matrix B accounts for the
dimension reducing transformation. However, it is not necessary to compute the dimension reducing
transformation G from B explicitly, as long as we can find the reduced dimensional representation
of a given data item. If the matrix B is already determined, the matrix Y can be computed by solving
40
D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S
Any given document q ∈ Rm×1 can be transformed to the lower dimensional space by solving the
minimization problem
min kBq̂ − qk2 . (4)
q̂∈Rl×1
Latent Semantic Indexing that utilizes the SVD (LSI/SVD) can be viewed as a variation of the
model (2) with B = Ul (16), where Ul Σl VlT is the rank l truncated SVD of A. Then q̂ = UlT q is
obtained by solving the least squares problem
min kBq̂ − qk2 = min kUl q̂ − qk2 . (5)
q̂∈Rl×1 q̂∈Rl×1
In the Centroid dimension reduction algorithm (see Algorithm 1), the ith column of B is the
centroid vector of the ith cluster, which is the average of the data items in the ith cluster, for 1 ≤ i ≤ p.
This matrix B is called the centroid matrix. Then, any vector q ∈ Rm×1 can be represented in the
p dimensional space as q̂, the solution of the least squares problem (4), where B is the centroid
matrix. In the Orthogonal Centroid algorithm (see Algorithm 2), the p dimensional representation
of a data vector q ∈ Rm×1 is given as q̂ = QTp q where Q p is an orthonormal basis for the centroid
matrix obtained from its QR decomposition.
The centroid-based dimension reduction algorithms are computationally less costly than LSI/SVD.
They are also more effective when the data are already clustered. Although the centroid-based
schemes can be applied only when the data are linearly separable, they are suitable for text classifi-
cation problems, since text data is usually linearly separable in the original dimensional space (13).
For a nonlinear extension of the Orthogonal Centroid method that utilizes kernel functions, see (18).
41
K IM , H OWLAND AND PARK
Algorithm 3 LDA/GSVD
Given a data matrix A ∈ Rm×n with p clusters, this algorithm computes the columns of the matrix
G ∈ Rm×(p−1) , which preserves the cluster structure in the reduced dimensional space, and it also
computes the p − 1 dimensional representation Y of A.
1. Compute Hb ∈ Rm×p and Hw ∈ Rm×n from A according to Eqns. (7) and (6), respectively.
3. Let t = rank(H).
R−1W
0
X =Q ,
0 I
6. Y = GT A
Since
p
trace(Sw ) = ∑ ∑ ka j − ci k22
i=1 j∈Ni
measures the closeness within the clusters, and
p
trace(Sb ) = ∑ ∑ kci − ck22
i=1 j∈Ni
measures the remoteness between the clusters, the goal is to minimize the former while maximizing
the latter in the reduced dimensional space. Once again letting GT ∈ Rl×m denote the transformation
that maps a column of A in the m dimensional space to a vector in the l dimensional space, the
goal can be expressed as the simultaneous minimization of trace(GT Sw G) and maximization of
trace(GT Sb G).
When Sw is nonsingular, this simultaneous optimization is commonly approximated by maxi-
mizing
J1 (G) = trace((GT Sw G)−1 (GT Sb G)).
It is well known that the global maximum is achieved when the columns of G are the eigenvectors
of Sw−1 Sb that correspond to the l largest eigenvalues (7; 25). In fact, when the reduced dimension
l ≥ p − 1, trace(Sw−1 Sb ) is exactly preserved upon dimension reduction, and equals λ1 + · · · + λ p−1 ,
where each λi ≥ 0. Without loss of generality, we assume that the term-document matrix A is parti-
tioned as
A = [A1 , · · · , A p ]
42
D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S
where the columns of each block Ai ∈ Rm×ni belong to the cluster i. Defining the matrices
and
√ √
Hb = [ n1 (c1 − c), . . . , n p (c p − c)] ∈ Rm×p , (7)
then
Sw = Hw HwT and Sb = Hb HbT .
As the product of an m × n matrix with an n × m matrix, Sw will be singular when the number of
terms m exceeds the number of documents n. In that case, classical discriminant analysis fails.
However, if we rewrite the eigenvalue problem Sw−1 Sb xi = λi xi as
where the columns of QH Z ∈ Rm×(p+n) are orthonormal. There exists othogonal Q ∈ Rm×m whose
first p + n columns are QH Z. Hence
ΣH 0
H =P QT ,
0 0
where there are now m − t zero columns to the right of ΣH . Since RH ∈ R(p+n)×(p+n) is a much
smaller matrix than H ∈ R(p+n)×m , the required memory is substantially reduced. In addition, the
computational complexity of the algorithm is reduced to O (mn2 ) + O (n3 ) (8), since this step is the
dominating part.
4. Classification Methods
To test the effect of dimension reduction in text classification, three different classification methods
were used: centroid-based classification, k-nearest neighbor (kNN), and support vector machines
(SVMs). Each classification method is modified by introducing some threshold values to perform
classification correctly when a document has membership in multiple classes. In this section, we
briefly review the three classification methods and discuss their modifications.
43
K IM , H OWLAND AND PARK
• find the index j such that sim(q, ci ), 1 ≤ i ≤ p, is minimum (or maximum), where sim(q, ci )
is the similarity measure between q and ci . (For example, sim(q, ci ) = kq − ci k2 using the L2
norm, and we take the index with the minimum value. Using the cosine measure,
qT ci
sim(q, ci ) = cos(q, ci ) = ,
kqk2 kci k2
qT ci
arg max (8)
1≤i≤p kqk2 kci k2
where ci is the centroid of the ith cluster of the training data. When dimension reduction is per-
formed by the Centroid algorithm, the centroids of the full space become the columns ei ∈ R p×1 of
the identity matrix. Then the decision rule becomes
q̂T ei
arg max , (9)
1≤i≤p kq̂k2 kei k2
where q̂ is the reduced dimensional representation of the document q. This shows that classification
can be performed by simply finding the index i of the vector q̂ with the largest component. Centroid-
based classification has the advantage that the computation involved is extremely simple. We can
also classify using the L2 norm similarity measure by finding the centroid that is closest to q in L2
norm.
The original form of centroid-based classification finds the nearest centroid and assigns the
corresponding class as the predicted class. To allow an assignment of any document to multiple
classes, we introduce the decision rule for centroid-based classification as
where y(x, j) ∈ {+1, −1} is the classification for document x with respect to class j (if y > 0 then
the class is j, else the class is not j), sim(x, c j ) is the similarity between the test document x and the
centroid vector c j for the class j, and θcj is the class specific threshold for the binary decision for
y(x, j) in centroid-based classification. In this way, document x will be a member of class j if its
similarity to the centroid vector c j for the class is above the threshold.
44
D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S
1. Using the similarity measure sim(q, a j ) for 1 ≤ j ≤ n, find the k nearest neighbors of q.
3. Assign q to the cluster with the greatest count in the previous step.
where kNN is the set of k nearest neighbors for document x, y(di , j) ∈ {+1, −1} is the classification
for document di with respect to class j (if y > 0 then the class is j, else the class is not j), sim(x, di )
is the similarity between the test document x and the training document di , and θkNN j is the class
specific threshold for kNN classification.
where y(x, j) ∈ {+1, −1} is the classification for document x with respect to class j, SV is the set
of support vectors, and θSV
j
M is the class specific threshold for the binary decision. This threshold is
set so that a new document x must not be classified to belong to class j when it is located very close
to the optimal separating hyperplane, i.e. when the decision is made with a low reliability. We use
the linear kernel K =< x, xi >, the polynomial kernel K = [< x, xi > +1]d , where d is the degree of
the polynomial, and the Gaussian RBF (radial basis function) kernel K = exp(−γkx − xi k2 ), where
γ is a parameter that controls the width of the Gaussian function.
5. Experimental Results
Prediction results are compared for the test documents in the full space without any dimension re-
duction as well as those in the reduced space obtained by LSI/SVD, Centroid, Orthogonal Centroid,
and LDA/GSVD dimension reduction methods. For SVMs, we optimized the regularization param-
eter C, polynomial degree d for the polynomial kernel, and γ for the Gaussian RBF (radial basis
function) kernel for each full and reduced dimension data set.
45
K IM , H OWLAND AND PARK
Table 1: Text classification accuracy (%) using centroid-based classification, k-nearest neighbor
classification, and SVMs, with LSI/SVD dimension reduction on the MEDLINE data set.
The Euclidean norm (L2 ) and the cosine similarity measure (Cosine) were used for the
centroid-based and kNN classification.
The first data set that we used was a subset of the MEDLINE database with 5 classes. Each class
has 500 documents. The set was divided into 1250 training documents and 1250 test documents.
After stemming and stoplist removal, the training set contains 22095 distinct terms. For this data,
each document belongs to only one class, and we used the original form of the three classification
algorithms without introducing the threshold.
The second data set was the “ModApte” split of the Reuter-21578 text collection. We only used
90 classes for which there is at least one training and one test example in each class. It contains
7769 training documents and 3019 test documents. The training set contains 11941 distinct terms
after preprocessing with stoplist removal and stemming. The Reuter data set contains documents
that belong to multiple classes, so the classification methods utilize thresholds.
We used a standard weight factor for each word stem:
t fi log(id fi )
φi (x) = , (13)
κ
where t fi is the number of occurrences of term i in document x, id fi = n/d is the ratio between
the total number of documents n and the number of documents d containing the term, and κ is the
normalization constant that makes kφk2 = 1.
Table 1 reports text classification accuracy for the MEDLINE data set using LSI/SVD with a
range of values for the reduced dimension. The smallest reduced dimension, l = 5, is included in
order to compare with centroid-based and LDA/GSVD methods, which reduce the dimension to 5
and 4, respectively. Since the training set has the nearly-full rank of 1246, we include the reduced
dimensions 1246 and 1247 at the high end of the range. For a training set of size 1250, the reduced
dimension l = 300 is generous. However, we observe that kNN classification with L2 norm simi-
larity produces poor classification results for l values from 100 to 500. This is consistent with the
common belief that cosine similarity performs better with unnormalized text data. Also, classifica-
tion accuracy using 5NN lags that for higher values of k, suggesting that k=5 is too small for classes
46
D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S
Table 2: Text classification accuracy (%) with different kernels in SVMs with and without dimen-
sion reduction on the MEDLINE data set. The regularization parameter C for each case
was optimized by numerical experiments. Dimension of each training term-document ma-
trix is shown. LDA/GSVD4 and LDA/GSVD5 represent the results from LDA/GSVD
where the reduced dimensions are 4 and 5, respectively.
of size 250. It is noteworthy that even with LSI, which makes no attempt to preserve the cluster
structure upon dimension reduction, SVM classification achieves very consistent classification re-
sults for reduced dimensions of 100 or greater, and the SVM accuracy exceeds that of the other
classification methods.
Table 2 shows text classification accuracy (%) with different kernels in SVMs, with and without
dimension reduction on the MEDLINE data set. Note that the linearopt values are optimal over all
the values of the regularization parameter C that we tried, and the RBFopt values are optimal over
all the γ values we tried. This table shows that the prediction results in the reduced dimension are
similar to those in the original full dimensional space, while achieving a significant reduction in
time and space complexity. In the reduced space obtained by the Orthogonal Centroid dimension
reduction algorithm, the classification accuracy is insensitive to the choice of the kernel. Thus, we
can choose the linear kernel in this case instead of the computationally more expensive polynomial
or RBF kernel.
Table 3 shows classification accuracy obtained by all three classification methods – centroid-
based, kNN with three different values of k, and the optimal result from SVM – for each dimension
reduced data set and the full space. For the LDA/GSVD dimension reduction method, the classi-
fication accuracy with cosine similarity measure is lower with centroid-based classification as well
as with kNN, while the results with L2 norm are better. This is due to the formulation of trace
optimization criteria in terms of the L2 norm. With LDA/GSVD, documents from the same class in
47
K IM , H OWLAND AND PARK
Table 3: Text classification accuracy (%) using centroid-based classification, k-nearest neighbor
classification, and SVMs, with and without dimension reduction on the MEDLINE data
set. The Euclidean norm (L2 ) and the cosine similarity measure (Cosine) were used for
centroid-based and kNN classification.
Table 4: Text classification accuracy (%) of the 5 classes and the microaveraged performance over
all 5 classes on the MEDLINE data set. All results are from SVMs using optimal kernels.
the full dimensional space tend to be transformed to a very tight cluster or even to a single point in
the reduced space, since the LDA/GSVD algorithm tends to minimize the trace of the within cluster
scatter. This seems to make it difficult for SVMs to find a binary classifier with low generalization
error.
Table 4 shows text classification accuracy for the 5 classes using SVMs with and without dimen-
sion reduction methods on the MEDLINE data set. The colon cancer and oral cancer documents
were relatively hard to classify correctly.
The REUTERS data set has many documents that are classified to more than 2 classes, whereas
no document is classified to belong to more than one class in the MEDLINE data set. While we
48
D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S
Table 5: Comparison of micro-averaged F1 scores for 3 different classification methods with and
without dimension reduction on the REUTERS data set. The Euclidean norm (L2 ) and the
cosine similarity measure (Cosine) were used for the centroid-based classification. The
cosine similarity measure was used for the kNN classification. The dimension of the full
training term-document matrix is 11941×9579 and that of the reduced matrix is 90×9579.
could handle relatively large matrices using a sparse matrix representation and sparse QR decom-
position in the Centroid and Orthogonal Centroid dimension reduction methods, results for the
LDA/GSVD dimension reduction method are not reported, since we ran out of memory while com-
puting the GSVD. For this data set, we built a series of threshold-based classifiers, optimizing the
thresholds to capture the multiple class membership. All class specific thresholds (θkNN
j , θcj , θSV
j
M)
are determined by numerical experiments. Though we obtained precision/recall break even points
by optimizing the thresholds, we report values of the F1 measure (26) which is defined as
2rp
F1 = , (14)
r+ p
where r is recall and p is precision for a binary classification. Table 5 shows that the effectiveness
of classification was preserved for the Orthogonal Centroid dimension reduction algorithm, while it
became worse for the Centroid dimension reduction algorithm. This is due to a property of the Cen-
troid algorithm that the centroids of the full space are projected to the columns of the identity matrix
in the reduced space. This orthogonality between the centroids may make it difficult to represent the
multiclass membership of a document by separating closely related classes after dimension reduc-
tion. The pattern of prediction measure F1 for each class is also preserved by Orthogonal Centroid
in Table 6. The macro-averaged F1 and micro-averaged F1 for the 10 most frequent classes are also
presented.
49
K IM , H OWLAND AND PARK
Table 6: F1 scores of the 10 most frequent classes and micro-averaged performance over all 90
classes on the REUTERS data set. All results are from SVMs using optimal kernels.
The dimension of the full training term-document matrix is 11941×9579 and that of the
reduced matrix is 90×9579.
SVMs, kNN, and centroid-based classification. For the three cluster-preserving methods, the re-
sults show surprisingly high prediction accuracy, which is essentially the same as in the original
full space, even with very dramatic dimension reduction. They justify dimension reduction as a
worthwhile preprocessing stage for achieving high efficiency and effectiveness. Especially for kNN
classification, the savings in computational complexity in classification after dimension reduction
are significant. In the case of SVM the savings are also clear, since the distance between two pairs
of input data points need to be computed repeatedly with and without the use of the kernel function,
and the vectors become significantly shorter with dimension reduction.
We have also introduced threshold based classifiers for centroid-based classification and SVMs
in order to capture the overlap structure between closely related classes. Prediction results with the
Centroid dimension reduction method became better compared to those from the full space for the
completely disjoint MEDLINE data set, but became worse for the REUTERS data set. Since the
Centroid dimension reduction method maps the centroids to unit vectors ei which are orthogonal
to each other, it is helpful for the disjoint data set, but not for a data set which contains documents
belonging multiple classes. We observed that prediction accuracy with the Orthogonal Centroid di-
mension reduction algorithm was preserved for SVMs as well as with centroid-based classification.
The Orthogonal Centroid dimension reduction method maximizes the between cluster relationship
using the relatively inexpensive reduced QR decomposition, compared to LDA/GSVD which also
considers the within cluster relationship but requires a more expensive rank revealing decomposition
such as the singular value decomposition (10; 11).
50
D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S
The better prediction accuracy using SVMs is due to low generalization error by maximizing
the margin, and the capability to handle non-linearity by kernel choice. Although most classes of
the Reuters-21578 data set are linearly separable (13), there seems to be some level of non-linearity.
For non-linearly separable data, SVMs with appropriate nonlinear kernel functions would work as a
better classifier. Another way to handle non-linearly separable data is to apply nonlinear extensions
of the dimension reduction methods, including those presented in (18; 19). All of the dimension
reduction methods presented here can also be applied to visualize the higher dimensional structure
by reducing the dimension to 2- or 3-dimensional space.
We conclude that dramatic dimension reduction of text documents can be achieved, without
sacrificing classification accuracy. For the document sets we tested, the Orthogonal Centroid method
did particularly well at preserving the cluster structure from the full dimensional representation.
That is, the prediction accuracies for Orthogonal Centroid rival those of the full space, even though
the dimension is reduced to the number of clusters. The savings in computational complexity are
significant using either kNN classification or SVM.
Acknowledgments
This material is based upon work supported by the National Science Foundation Grant No. CCR-
0204109. Any opinions, findings and conclusions or recommendations expressed in this material
are those of the authors and do not necessarily reflect the views of the National Science Foundation
(NSF). The authors would also like to thank University of Minnesota Supercomputing Institute
(MSI) for providing the computing facilities.
References
[1] M. W. Berry, Z. Drmac, and E. R. Jessup. Matrices, vector spaces, and information retrieval.
SIAM Review, 41:335–362, 1999.
[2] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent information
retrieval. SIAM Review, 37:573–595, 1995.
[3] M. W. Berry and R. D. Fierro. Low-rank orthogonal decompositions for information retrieval
applications. Numerical Linear Algebra with Applications, 3(4):301–327, 1996.
[4] Å. Björck. Numerical Methods for Least Square Problems. SIAM, Philadelphia, PA, 1996.
[5] N. Cristianini and J. Shawe-Taylor. Support Vector Machines and Other Kernel-based Learn-
ing Methods. Cambridge University Press, 2000.
[6] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent
semantic analysis. Journal of the Society for Information Science, 41:391-407, 1990.
[7] K. Fukunaga, Introduction to Statistical Pattern Recognition, Second ed., Academic Press,
1990.
[8] G. H. Golub and C. F. Van Loan. Matrix Computations, third edition. Johns Hopkins Univer-
sity Press, Baltimore, 1996.
51
K IM , H OWLAND AND PARK
[9] M. Heiler. Optimization Criteria and Learning Algorithms for Large Margin Classifiers.
Diploma Thesis, University of Mannheim., 2002.
[10] P. Howland, M. Jeon, and H. Park. Structure Preserving Dimension Reduction for Clustered
Text Data based on the Generalized Singular Value Decomposition. SIAM Journal of Matrix
Analysis and Applications, 25(1):165–179, 2003.
[11] P. Howland and H. Park. Generalizing discriminant analysis using the generalized singular
value decomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):
995-1006, 2004.
[12] M. Jeon, H. Park, and J. B. Rosen. Dimensional reduction based on centroids and least squares
for efficient processing of text data. In Proceedings for the First SIAM International Workshop
on Text Mining. Chicago, IL, 2001.
[13] T. Joachims. Text categorization with support vector machines: Learning with many relevant
features. In Proceedings of the European Conference on Machine Learning, pages 137–142,
Berlin, 1998.
[14] H. Lodhi, N. Cristianini, J. Shawe-Taylor, and C. Watkins. Text classification using string
kernels. Advances in Neural Information Processing Systems, 13:563–569, 2000.
[15] C. C. Paige and M. A. Saunders, Towards a generalized singular value decomposition, SIAM
Journal of Numerical Analysis, 18, pp. 398–405, 1981.
[16] H. Park, M. Jeon, and J. B. Rosen. Lower dimensional representation of text data based on
centroids and least squares, BIT Numerical Mathematics, 42(2):1–22, 2003.
[17] H. Park and L. Eldén. Downdating the rank-revealing URV decomposition. SIAM Journal of
Matrix Analysis and Applications, 16, pp. 138–155, 1995.
[18] C. Park and H. Park. Nonlinear feature extraction based on centroids and kernel functions.
Pattern Recognition, to appear.
[19] C. Park and H. Park. Kernel discriminant analysis based on the generalized singular value
decomposition. Technical report 03-017, Department of Computer Science and Engineering,
University of Minnesota, 2003.
[22] G. W. Stewart. An updating algorithm for subspace tracking. IEEE Transactions on Signal
Processing, 40:1535–1541, 1992.
[24] M. Stewart and P. Van Dooren. Updating a generalized URV decomposition. SIAM Journal of
Matrix Analysis and Applications, 22(2):479–500, 2000.
52
D IMENSION R EDUCTION IN T EXT C LASSIFICATION WITH SVM S
[27] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
[28] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.
[29] Y. Yang and X. Liu. A re-examination of text categorization methods. In 22nd Annual Inter-
national SIGIR, pages 42–49, Berkeley, August 1999.
53