0% found this document useful (0 votes)
29 views7 pages

A Novel Template Reduction Approach For The - Nearest Neighbor Method

it's a document based in the algorithm knn

Uploaded by

David Bastidas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

A Novel Template Reduction Approach For The - Nearest Neighbor Method

it's a document based in the algorithm knn

Uploaded by

David Bastidas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

890 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO.

5, MAY 2009

[9] H. Lin and L. Li, Large-margin thresholded ensembles for ordinal re- A Novel Template Reduction Approach for
gression: Theory and practice, in Proc. 17th Int. Conf. Algorithmic the -Nearest Neighbor Method
Learn. Theory, 2006, pp. 319333.
[10] S. Kramer, G. Widmer, B. Pfahringer, and M. DeGroeve, Prediction
Hatem A. Fayed and Amir F. Atiya
of ordinal classes using regression trees, Fundamenta Informaticae,
vol. 47, pp. 113, 2001.
[11] S. Har-Peled, D. Roth, and D. Zimak, Constraint classification: A new
approach to multiclass classification and ranking, Neural Inf. Process. AbstractThe K -nearest neighbor (KNN) rule is one of the most widely
Syst., vol. 15, pp. 785792, 2002. used pattern classification algorithms. For large data sets, the computa-
[12] R. Herbrich, T. Graepel, and K. Obermayer, Lrage margin rank tional demands for classifying patterns using KNN can be prohibitive. A
boundaries for ordinal regression, in Advances in Large Margin way to alleviate this problem is through the condensing approach. This
Classifiers. Cambridge, MA: MIT Press, 2000, pp. 115132. means we remove patterns that are more of a computational burden but
[13] P. McCullagh and J. A. Nelder, Generalized Linear Models. London, do not contribute to better classification accuracy. In this brief, we propose
U.K.: Chapman & Hall, 1983. a new condensing algorithm. The proposed idea is based on defining the
[14] P. McCullagh, Regression models for ordinal data, J. Roy. Statist. so-called chain. This is a sequence of nearest neighbors from alternating
Soc. B, vol. 42, no. 2, pp. 109142, 1980. classes. We make the point that patterns further down the chain are close
[15] V. E. Johnson and J. H. Albert, Ordinal Data Modeling (Statistics for to the classification boundary and based on that we set a cutoff for the pat-
Social Science and Public Policy). New York: Springer-Verlag, 1999. terns we keep in the training set. Experiments show that the proposed ap-
[16] A. Shashua and A. Levin, Ranking with large margin principle: Two proach effectively reduces the number of prototypes while maintaining the
approaches, Neural Inf. Process. Syst., vol. 15, pp. 937944, 2002. same level of classification accuracy as the traditional KNN. Moreover, it is
[17] H. Yu, J. Yang, and J. Han, Classifying large data sets using SVMs a simple and a fast condensing algorithm.
with hierarchical clusters, in Proc. 9th ACM SIGKDD Int. Conf.
Knowl. Disc. Data Mining, 2003, pp. 306315. Index TermsCondensing, cross validation, editing, K -nearest neighbor
[18] D. Boley and D. Cao, Training support vector machine using adap- (KNN), template reduction.
tive clustering, in Proc. 4th SIAM Int. Conf. Data Mining, 2004, pp.
126137.
[19] J. Wang, X. Wu, and C. Zhang, Support vector machines based on I. INTRODUCTION
k -means clustering for real-time business intelligence systems, Int. J.
Business Intell. Data Mining, vol. 1, no. 1, pp. 5464, 2005. K
The -nearest neighbor (KNN) classification rule is one of the
[20] J. Yuan, J. Li, and B. Zhang, Learning concepts from large scale im- most well-known and widely used nonparametric pattern classification
balanced data sets using support cluster machines, in Proc. 14th Annu. methods. Its simplicity and effectiveness have led it to be widely used
ACM Int. Conf. Multimedia, 2006, pp. 441450. in a large number of classification problems, including handwritten
[21] M. Almeida, A. Braga, and J. Braga, SVM-KM: Speeding SVMs
learning with a priori cluster selection and k -means, in Proc. 6th digits, satellite image scenes, and medical diagnosis [1][5]. For KNN,
Brazilian Symp. Neural Netw., 2000, pp. 162167. however, two major outstanding problems are yet to be resolved by
[22] H. Yu, J. Yang, J. Han, and X. Li, Making svms scalable to large data the research community. The first issue is the selection of the best K
sets using hierarchical cluster indexing, Data Mining Knowl. Disc., (number of neighbors to consider), as this problem is greatly affected
vol. 11, no. 3, pp. 295321, 2005.
[23] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang, Representative sampling
by the finite sample nature of the problem. The second issue is the
for text classification using support vector machines, in Proc. 25th computational and the storage issue. The traditional KNN rule requires
Eur. Conf. Inf. Retrieval Res., 2003, pp. 393407. the storage of the whole training set which may be an excessive amount
[24] K. Zhang and J. T. Kwok, Block-quantized kernel matrix for fast spec- of storage for large data sets and leads to a large computation time
tral embedding, in Proc. 23rd Int. Conf. Mach. Learn., 2006, vol. 23, in the classification stage. There are two well-known procedures for
pp. 10971104.
[25] R. Bellman, Introduction to Matrix Analysis, 2nd ed. Philadelphia, reducing the number of prototypes (sometimes referred to as template
PA: SIAM, 1997. reduction techniques). The first approach, called editing, processes
[26] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Sil- the training set with the aim of increasing generalization capabilities.
verman, and A. Y. Wu, An efficient k -means clustering algorithm: This is accomplished by removing prototypes that contribute to the
Analysis and implementation, IEEE Trans. Pattern Anal. Mach. In-
tell., vol. 24, no. 7, pp. 881892, Jul. 2002.
misclassification rate, for example, removing outlier patterns or
[27] I. S. Dhillon, Y. Guan, and B. Kulis, Weighted graph cuts without removing patterns that are surrounded mostly by others of different
eigenvectors: A multilevel approach, IEEE Trans. Pattern Anal. Mach. classes [6][8]. The second approach is called condensing. The aim
Intell., vol. 29, no. 11, pp. 19441957, Nov. 2007. of this approach is to obtain a small template that is a subset of the
[28] W. Waegeman, B. Baets, and L. Boullart, Roc analysis in ordinal re- training set without changing the nearest neighbor decision boundary
gression learning, Pattern Recognit. Lett., vol. 29, pp. 19, 2008.
[29] J. Rennie and N. Srebro, Loss functions for preference levels: Regres- substantially. The idea is that the patterns near the decision boundary
sion with discrete, ordered labels, in Proc. IJCAI Multidisciplinary are crucial to the KNN decision, but those far from the boundary
Workshop Adv. Preference Handling, 2005, pp. 180186. do not affect the decision. Therefore, a systematic removal of these
ineffective patterns helps to reduce the computation time. This can be
established by reducing the number of prototypes that are centered in
dense areas of the same class [9][20]. In this brief, we consider only

Manuscript received August 24, 2007; revised December 21, 2008; accepted
March 11, 2009. First published April 21, 2009; current version published May
01, 2009.
H. A. Fayed is with the Department of Engineering Mathematics and Physics,
Cairo University, Cairo 12613, Egypt (e-mail: [email protected]).
A. F. Atiya is with the Department of Computer Engineering, Cairo Univer-
sity, Cairo 12613, Egypt (e-mail: [email protected]).
Color versions of one or more of the figures in this brief are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNN.2009.2018547

1045-9227/$25.00 2009 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 11:00 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 5, MAY 2009 891

the condensing approach. Below is a short summary of some existing [19], and 6) incorporation of both the proximity between patterns and
algorithms for the condensing approach. geometrical distribution around the given pattern [20]. There are other
In 1968, Hart [9] was the first to propose an algorithm for reducing methods that combine both editing and condensing techniques forming
the size of the stored data for the nearest neighbor decision (the algo- a hybrid model [21].
rithm is called CNN). Hart defined a consistent subset of the data as one In this brief, we introduce a new condensing algorithm, namely, the
that classifies the remaining data correctly with the nearest neighbor template reduction for KNN (TRKNN). The basic idea is to define a
rule. He built this consistent set by sequentially adding to it data points chain of nearest neighbors. By setting a cutoff value for the distances
from the training set as long as the added data point is misclassified among the chain, we effectively separate the patterns into the selected
(using the 1-NN rule). By construction, the resulting reduced subset condensed set (probably consisting of patterns near the decision
classifies all the training data correctly. Empirical results have shown boundary) and the removed set (probably interior patterns). The paper
that Harts CNN rule considerably reduces the size of the training set is organized as follows. The proposed TRKNN method is described
at the expense of minimal or even no degradation in classification per- in Section II. Some analytical insights are introduced in Section III.
formance. The drawback of CNN is that frequently it may keep some Then the proposed method is validated experimentally in Section IV.
points that are far from the decision boundary. To combat this, in 1972, Results are discussed in Section V. Finally, conclusions are drawn in
Gates [10] proposed what he called the reduced nearest neighbor rule Section VI.
(RNN). This method is based on first applying CNN and then per-
forming a postprocessing step. In this postprocessing step, the data II. TEMPLATE REDUCTION FOR KNN
points in the consistent set are revisited and removed if their deletion The goal of the proposed approach is to discard prototypes that are
does not result in misclassifying any point in the training set. Exper- far from the boundaries and have little influence on the KNN clas-
imental results confirmed that RNN yields a slightly smaller training sification. To establish this, we first introduce the so-called nearest
subset than that obtained with CNN. In [11], Bhattacharya et al. (1992) neighbor chain. This is simply a sequence of the nearest neighbors
proposed two methods, one based on the Voronoi graph and the other from alternating classes. Consider first the following definitions. Con-
based on the Gabriel graph. The methods have the merit that they are sider a pattern xi (also call it xi0 and let it be from class !m ). Let
exact and yield sets independent of the order in which the data are pro- xi1  NN(xi0 ) denote the nearest neighbor to xi0 that is from a dif-
cessed. The method based on a Voronoi graph yields a condensed set ferent class. Similarly, let xi2  NN(xi1 ) denote the nearest neighbor
which is both training-set consistent (i.e., it classifies all the training to xi1 that is from the starting class (!m ). We continue in this manner,
data correctly) and decision-boundary consistent (i.e., it determines ex- with xi;j +1  NN(xij ). This sequence of xij s (whose class mem-
actly the same decision boundary as that of the entire training set). berships alternates between class !m and the other classes) constitutes
However, it suffers from a large complexity due to the need to con- the nearest neighbor chain. Below is the precise definition.
struct the Voronoi diagram. On the other hand, the method based on Definition: A nearest neighbor chain Ci of a pattern xi (of Class
the Gabriel diagram is faster but it is neither decision-boundary con- !m ) is defined as the sequence xi0 ; xi1 ; xi2 ; . . . ; xik and the sequence
sistent nor training-set consistent. In [12], Wilson and Martinez (2000) di0 ; di1 ; . . . ; di;k01 where the root pattern xi0 = xi , and xij is the
closest pattern to xi;j 01 (of a class different from !m if j is odd and
of Class !m if j is even). Moreover, dij = kxi;j +1 0 xij jj2 is the Eu-
presented five algorithms for reducing the size of case bases: DROP1,
DROP2, . . ., DROP5. Decremental reduction optimization procedure 1
(DROP1) is the basic removal scheme based on so-called associate pat- clidean distance between patterns xi;j +1 and xij . The chain is stopped
terns. The associate patterns for some pattern p are the patterns which at xik if xi;k+1 = xi;k01 . Note that the distance sequence is a nonin-
creasing sequence (i.e., dij  di;j +1 ).
have p as one of their K -nearest neighbors. The removal of p is deter-
Fig. 1 shows some examples of constructed chains for a two-class
mined based on its effect on the classification of its associates. DROP2
problem. In summary, a chain is constructed as follows. Start from
is a modification whereby the order of the patterns to be removed is
a pattern xi . Find the nearest neighbor from a different class. Then,
selected according to a certain distance criterion in a way to remove
from that pattern find the nearest neighbor from the starting class. Con-
patterns furthest from the decision boundary first. DROP2 also dif-
tinue in this manner until we end up with two patterns that are nearest
fers from DROP1 in that deletion decisions still rely on the original
neighbor to each other. Note that by construction the distances between
set of associates. DROP3, DROP4, and DROP5 are versions whereby
the patterns in the chain form a nonincreasing sequence. Note also that
noise-filtering pass is performed prior to applying the DROP2 proce-
patterns downstream the chain will probably be close to the classifica-
dure. In [13], Mollineda et al. (2002) obtained a condensed 1-NN clas- tion boundary, because they will have smaller distances from the pat-
sifier by merging the same class nearest clusters as long as the set of terns of different classes. This provides the basis of the proposed con-
new representatives correctly classify all the original patterns. In [14], densing procedure.
Wu et al. (2002) proposed an efficient method to reduce the training set The basic idea of the proposed condensing approach is as follows.
required for KNN while maintaining the same level of classification ac- For each pattern xi in the training set, we construct its corresponding
curacy, namely, the improved KNN (IKNN). This is implemented by chain Ci . The pattern xij in the chain is dropped (from the selected
iteratively eliminating patterns, which exhibit high attractive capacities condensed set) if dij > 1 di;j +1 where is a threshold >1, and
(the attractive capacity sy of a pattern y is defined as the number of pat- j = 0; 2; 4; . . . up to the size of the chain. Note that we allow only
terns from class C (y ), which are closer to pattern y than other patterns patterns from the same class as that of xi to be eliminated (i.e., we
belonging to other classes). The algorithm filters out a large portion consider only the even patterns in the chain). This is important when
of prototypes that are unlikely to match against the unknown pattern. dealing with a multiclass problem as the chain is constructed using the
This accelerates the classification procedure considerably, especially in one-against-all concept as has been illustrated earlier. Typically, when
cases where the dimensionality of the feature space is high. Other ap- starting the chain with an interior point, the distance to the next point
proaches for condensing are based on: 1) evolutionary algorithms and in the chain will be large. As the chain converges onto the boundary
decision trees [15], 2) space partitioning [16], 3) decision boundary points, the distances decrease in value and will more or less level off.
preservation [17], 4) estimation of the distribution of representatives This gives a rationale for the proposed cutoff procedure. Because there
according to the information they convey [18], 5) a gradient-descent is a significant decrease in distances, the considered pattern is deemed
technique for learning prototype positions and local metric weights to be probably an interior point and can be discarded, whereas if the

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 11:00 from IEEE Xplore. Restrictions apply.
892 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 5, MAY 2009

Fig. 1. Illustrative example of the chains.

distances do not decrease too much, then we are probably oscillating TABLE I
around the classification boundary, and the pattern is kept. Below are SUMMARY OF DATA SETS
the precise steps of the condensing method.

Algorithm: TRKNN

Inputs:
Training set
.
Distance ratio threshold .
Output:
Reduced training set
.
shortcut. Note also that the condensed set does not depend on some pat-
Method: tern presentation order, like many condensing algorithms. The reason is
that at each step the full training set is considered as a whole. Another
For each pattern xi in
comment that applies also to most condensing methods is that after
Find its corresponding chain Ci condensing it is imperative to have the number of neighbors K (for the
KNN classification) a little reduced than K used with the whole data
End For set. This is due to the somewhat redundant patterns being removed.
For each chain Ci This reduction will be bigger if more data is removed.

For j = 0; 2; 4; . . . up to the size of C i III. ANALYTICAL INSIGHTS

If dij > 1 di;j +1 then mark the pattern xij The proposed algorithm relies on the concept that a pattern near the
classification boundary will tend to have a nearest neighbor (from an-
End For other class) that is fairly close. On the other hand for an interior pattern
that distance will tend to be larger. Based on this concept, we discard
End For
the latter patterns. While this concept is fairly intuitive, it would be
Drop all marked patterns from
. beneficial to provide some analytical analysis in order to gain insight
and to understand the degree and the factors affecting that relationship.
Below we develop some approximate analysis.
Fig. 1 shows an example that illustrates the working of the algorithm. For simplicity, consider a two-class case. Consider a pattern x0 from
The closest pattern of different class to pattern x1 is x11 and the dis- class !1 , and let the dimension of the pattern vector be D . Moreover,
tance between them is d11 . Similarly, the closest pattern of different let p(xj!2 ) denote the class conditional density for class !2 , and let
class to pattern x11 is x12 and the distance between them is d12 . Now there be N training patterns from class !2 . Finally, let NN(x0 ) denote
x1 is dropped if d11 > 1 d12 . the nearest neighbor to pattern x0 that is from class !2 . Let r be the
Note that some computational savings can be achieved by observing distance from x0 to NN(x0 ). Define the following random variable:
that some chains encompass some other smaller chains. By operating
on the larger chains first, we automatically get the condensing results
of the smaller chains contained in them, leading to some computational
= p(xj!2 )dx:
kx0x k r

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 11:00 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 5, MAY 2009 893

TABLE II
SUMMARY OF AVERAGE TRAINING TIME AND TEST TIME IN SECONDS

TABLE III
MEAN NUMBER OF PROTOTYPES (NPROT) AND MEAN TEST ERROR RATES (ERR) OVER THE DIFFERENT FOLDS.
STANDARD DEVIATIONS ARE SHOWN IN BRACKETS

It is well known (see [22] and [23]) that such a variable obeys the fol- to have a relative cutoff point. That is, we discard patterns based on
lowing beta density function: comparing them according to the successive distances in the chain (i.e.,
when dij > 1 di;j +1 ).
p( ) = N (1 0  )N01 ; 0    1:
Assume that the number of training patterns from class !2 is suffi- IV. EXPERIMENTAL RESULTS
ciently large such that the NN(x0 ) is close to x0 . Hence, within the To validate the proposed algorithm, we compared it with the tradi-
ball centered around x0 stretching out to NN(x0 ), the density p(xj!2 ) tional KNN and some of the condensing methods: the DROP2 [12] and
can be approximated as almost constant, and hence the IKNN [14] for several real-world data sets. Note that, as we focus on

 c 1 rD p(x0 j!2 )
comparing condensing methods, we do not employ DROP3, DROP4,
 or DROP5 [12], as these just add preprocessing steps that can be applied
where the term c 1 rD represents the volume of a D -dimensional ball
to any method. On the other hand, DROP1s accuracy is significantly
of radius r , with c   D=2 =0(1 + D=2). Then, we get
low compared to KNN and DROP2. Therefore, to attain a fair compar-
ison, DROP2 is included in the comparison. We use the fivefold vali-
1=D


dation procedure for the purpose of tuning the key parameters of each
r :
c 1 p(x0 j!2 ) method. In this approach, the training set is partitioned into five equal
parts. Training is performed on four parts and validated on the fifth part.
The expectation of r is then given by Then the validation part is rotated and training is performed again. The
1 1=D process is repeated five times, and the validation classification error
N (1 0  )N01

E (r) = d
c 1 p(x0 j!2 )
on all five parts is combined. The parameter values that yield the best
0 validation results will then be chosen for testing the performance. The
which can be evaluated as tuned parameters are as follows. In all methods, suggested values for K
1=D are odd values from 1 to 9. For IKNN, suggested values for the attrac-
p(x0 j!2 )01=D :
0(N + 1)0(1 + 1=D )(0(1 + D=2))
E (r) = p0(N + 1 + 1=D) tive capacity threshold (S ) are: [0:01; 0:05; 0:1]2Nmin , where Nmin is
the minimum number of patterns corresponding to the same class while
The previous relation confirms the fact that the distance to the nearest suggested values for the portion function are:  (t) = 2 (t + 1)00:5 ,
neighbor from the other class is small if we are close to the classifica- where 2 f0:1; 0:2g (for more details and description of the param-
tion boundary (where the opposing class-conditional density p(xj!2 ) eters, see [14]). For TRKNN, suggested values for are 1.2, 1.4, and
would be high). Conversely, that distance would be large if p(xj!2 ) 1.6. The distance metric used is the Euclidean distance for all methods.
was small (signifying that we are in the interior and far away from the Concerning the DROP2 method, there are no tunable parameters. There
boundary). Moreover, that monotone relationship decays more slowly are four main performance measures. The training time represents the
for large dimensional problems. One might contemplate the situation time it takes to condense the training set, including searching for the
when we are near the boundary but p(xj!2 ) is still small. This situ- optimal parameters, such as K and the others. (However, it is com-
ation arises when the other class conditional density p(xj!1 ) is also puted as the average time per tested parameter set. This way we avoid
small, that is, we operate in a relatively sparse area in the space. To penalizing methods that have a larger number of parameters, such as
compensate for that, the algorithm uses the other distances in the chain the competing IKNN, or have finer parameter grid.) The testing time

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 11:00 from IEEE Xplore. Restrictions apply.
894 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 5, MAY 2009

TABLE IV
COMBINED 5 2 2 CROSS-VALIDATION F TEST FOR THE TEST CLASSIFICATION ERROR (ERR) AND THE NUMBER OF PROTOTYPES (NPROT).
REJECTION DECISION OF THE NULL HYPOTHESIS IS SHOWN IN BRACKETS

Fig. 2. Test classification error versus the number of prototypes over the 5 22 Fig. 3. Test classification error versus the number of prototypes over the 5 22
folds for the Pima Indians data set.
folds for the Breast Cancer data set.

the average on replication i be p = (pi + pi )=2 and let the es-


(1) (2)

timated variance be si2 = (pi 0 p)2 + (pi 0 p)2 . The combined


(1) (2)
represents the classification time using the condensed set. It is very
closely tied to the third measure, which is the number of patterns in 5 2 2 cross-validation F test is applied by assuming that the statistic
the condensed set. The test classification error is the final performance f = 5i=1 2j =1 (p(ij ) )2 =2 5i=1 si2 has approximately an F distri-
measure. The main goal of any condensing method is to reduce the bution with ten and five degrees of freedom. The null hypothesis is
number of patterns as much as possible, with as little sacrifice as pos- rejected with 95% confidence if f is greater than 4.74.
sible to the classification accuracy. We used five real-world data sets. The first data set was obtained
In comparing the classification error and the number of prototypes from cancer1.dt file from Proben1 database [26], which was created
of any two methods a challenging issue is to test whether the difference based on the Breast Cancer Wisconsin problem data set from the
is statistically significant. We have used the combined 5 2 2 cross-val- University of California at Irvine (UCI) Machine Learning Repository
idation F test [24] (a study in [25] shows the superiority of this tests database [27]. The second data set was obtained from diabetes1.dt file
compared to alternative ones). To apply this test, five replications of from Proben1 database [26], which was created based on the Pima
twofold cross validation are performed. In each replication, the data Indians Diabetes problem data set from the UCI Machine Learning
set is divided into two equal-sized sets, one for training and the other Repository database [27]. The remaining data sets were obtained from
for testing. Let pi be the difference between the error rates of the
(j )
the UCI repository [27]. Summary of the data sets is shown in Table I.
two classifiers on fold j = 1; 2 for replication i = 1; . . . ; 5. Let It shows the names and the details of the data sets, such as the number

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 11:00 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 5, MAY 2009 895

Fig. 4. Test classification error versus the number of prototypes over the 5 22 Fig. 5. Test classification error versus the number of prototypes over the 5 22
folds for the Balance Scale data set. folds for the Landsat data set.

of patterns, the number of features, and the number of classes. We per-


formed the implementation using MATLAB 7 on Windows XP with
SP2 running on Pentium IV 2.4-GHz PC with 768-MB RAM.
V. RESULTS
The results of the average training time over all folds and the test
time (in seconds) are shown in Table II while the number of prototypes
and the test classification error rates are shown in Table III. The results
of the 5 2 2 significance test applied to the test classification error and
to the number of prototypes are shown in Table IV for the different
data sets. The table shows that for all data sets the F test at the 95%
confidence level does not reject the hypothesis that IKNN and TRKNN
give a similar test error, and also does not reject the hypothesis that
KNN and TRKNN give a similar test error. On the other hand, TRKNN
gives smaller number of prototypes (i.e., higher reduction rate) than
IKNN for two of the data sets (Pima Indians and Balance Scale), also at
the 95% level. For the three other data sets, TRKNNs outperformance
is not significant at the 95% level. Compared to DROP2, we observe
the following. For the Pima Indians data set, DROP2 gets significantly
lower number of prototypes. But this is at the expense of significantly
worse test error (as if DROP2 dropped too many patterns to an extent Fig. 6. Test classification error versus the number of prototypes over the 5 22
folds for the Pendigits data set.
that it affected classification performance). On the other hand, for the
Pendigits data set, TRKNN produces a significantly lower number of
prototypes, while test error is comparable.
We note that there is generally a tradeoff between test error and On the other hand, for three of the problems (Breast Cancer, Pima In-
number of prototypes (NPROT) selected (or reduction ratio). To clarify dians, and Balance Scale) the number of prototypes for TRKNN is
that the proposed TRKNN is winning in the overall NPROT/test error significantly lower than that of IKNN. For one problem (Pendigits),
tradeoff, we performed the following analysis. It is based on the per- TRKNN beats in the NPROT aspect (i.e., gives lower NPROT), but by
formed 5 2 2 test. As mentioned, we perform the test ten times (on the a small amount. For the remaining problem (Landsat), both methods
two folds times the five partitions). Consider a single test (i.e., on one are about equal. When we say significantly beats it is based on the
specific fold and one specific partition) and record NPROT and the test fact that the averages are different and the standard deviations do not
error. Repeat ten times for the ten tests and get ten pairs of NPROT/test lead to overlapping of the points, which can be seen visually in the plot.
error numbers. Plot these pairs as points in the 2-D space with the axes For some of the problems (such as Balance Scale), IKNN gives lower
being NPROT and the test error. We will have ten points for TRKNN, average test error, but there is a large overlap of the points (in the test
corresponding to the NPROT/test error outcomes of the ten test sets, error dimension), and that makes the difference not statistically signifi-
ten other points for IKNN, corresponding also to the test outcomes for cant. Concerning DROP2 versus TRKNN, as observed before, one can
IKNN, and ten other points for DROP2. Figs. 26 show the plots for see that DROP2 obtains lower NPROT at the expense of a worse test
each of the tested UCI data sets. One can see in the plots that for all error for Pima Indians data set. On the other hand, for both Landsat and
problems the test errors for both TRKNN and IKNN are comparable. Pendigits, DROP2 produces significantly worse (i.e., higher) NPROT.

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 11:00 from IEEE Xplore. Restrictions apply.
896 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 5, MAY 2009

This happens while the test errors for DROP2 and TRKNN are compa- [6] D. L. Wilson, Asymptotic properties of nearest neighbor rules using
rable. One can see the clear separation in the NPROT dimension, while edited data, IEEE Trans. Syst. Man Cybern., vol. SMC-2, no. 3, pp.
in the test error dimension the data overlap. It seems that possibly for 408420, Jul. 1972.
[7] P. A. Devijver and J. Kittler, On the edited nearest neighbor rule, in
larger data sets DROP2 does not prune out enough points (which para- Proc. 5th Int. Conf. Pattern Recognit., Miami, FL, 1980, pp. 7280.
doxically are the type of problems where we need to drop points the [8] F. J. Ferri and E. Vidal, Colour image segmentation and labeling
most). through multiedit-condensing, Pattern Recognit. Lett., vol. 13, pp.
Concerning the training time (Table II), the time of TRKNN is con- 561568, 1992.
[9] P. E. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf.
siderably shorter than that of IKNN, by a factor of around 2 or 3 times. Theory, vol. IT-14, no. 3, pp. 515516, May 1968.
This is because the training time of TRKNN is dominated by the com- [10] W. Gates, The reduced nearest neighbor rule, IEEE Trans. Inf.
putation of the distance matrix (the matrix that holds distances between Theory, vol. IT-18, no. 3, pp. 431433, May 1972.
pairs of all training patterns) whose complexity is O(n2 ), where n is [11] B. K. Bhattacharya, R. S. Poulsen, and G. T. Toussaint, Application of
the training set size. This distance matrix is computed only once at proximity graphs to editing nearest neighbor decision rules, in Proc.
16th Symp. Interface Between Comput. Sci. Statist., 1984, pp. 97108.
the beginning of the training process. For IKNN, besides the compu- [12] D. R. Wilson and T. R. Martinez, Reduction techniques for instance-
tation of the distance matrix, there is an extra computation at each it- based learning algorithms, Mach. Learn., vol. 38, no. 3, pp. 257286,
eration for the evaluation and sorting of the attractive capacities with 2000.
complexity O(nA log A), where A is the average attractive capacity. [13] R. A. Mollineda, F. J. Ferri, and E. Vidal, An efficient prototype
For large data sets, this attractive capacity A could be rather large. This
merging strategy for the condensed 1-NN rule through class-con-

computation is repeated for a number of iterations (say J ), leading to


ditional hierarchical clustering, Pattern Recognit., vol. 35, pp.
27712782, 2002.
the extra complexity of O(nJA log A) (beyond that of TRKNN). Sim- [14] Y. Wu, K. Ianakiev, and V. Govindaraju, Improved k-nearest neighbor
ilarly, TRKNN is faster (in training speed) than DROP2, by a factor of classification, Pattern Recognit., vol. 35, pp. 23112318, 2002.
[15] J. R. Cano, F. Herrera, and M. Lozano, Using evolutionary algorithms
around 3 or 4. The reason for the slow training speed for DROP2 is
as instance selection for data reduction in KDD: An experimental
the need to sort the distances that are computed from each pattern to study, IEEE Trans. Evol. Comput., vol. 7, no. 6, pp. 561575, Dec.
its enemy pattern. The enemy pattern is defined as the closest pattern 2003.
from a different class. [16] J. S. Snchez, High training set size reduction by space partitioning
Overall, viewing all performance criteria, such as the test and prototype abstraction, Pattern Recognit., vol. 37, no. 7, pp.
15611564, 2004.
error/number of prototype tradeoff, and the speed, we feel that [17] R. Barandela, F. J. Ferri, and J. S. Snchez, Decision boundary pre-
TRKNN has an edge over the competing IKNN and DROP2 methods. serving prototype selection for nearest neighbor classification, Int. J.
Pattern Recognit. Artif. Intell., vol. 19, no. 6, pp. 787806, 2005.
[18] D. Huang and T. W. S. Chow, Enhancing density-based data reduction
VI. CONCLUSION using entropy, Neural Comput., vol. 18, no. 2, pp. 470495, 2006.
[19] R. Paredes and E. Vidal, Learning prototypes and distances: A pro-
In this brief, a new condensing method for KNN is proposed. The totype reduction technique based on nearest neighbor error minimiza-
tion, Pattern Recognit., vol. 39, no. 2, pp. 171179, 2006.
method drops patterns that are far away from the boundary and thus
[20] J. S. Snchez and A. I. Marqus, An LVQ-based adaptive algorithm
have little influence on the KNN classification. Experiments show that for learning from very small codebooks, Neurocomputing, vol. 69, no.
the proposed approach reduces the template set size without sacrificing 79, pp. 922927, 2006.
the accuracy compared to the traditional KNN and two recent con- [21] H. Brighton and C. Mellish, Advances in instance selection for in-
densing methods. In addition, this method can be considered simple stance-based learning algorithms, Data Mining Knowl. Disc., vol. 6,
pp. 153172, 2002.
in implementation, and is computationally fast. [22] D. A. Fraser, Nonparametric Methods in Statistics. New York: Wiley,
1957, ch. 4.
[23] R. D. Short and K. Fukunaga, The optimal distance measure for
REFERENCES nearest neighbor classification, IEEE Trans. Inf. Theory, vol. IT-27,
no. 5, pp. 622627, Sep. 1981.
[1] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. 2
[24] E. Alpaydin, Combined 5 2 cv F test for comparing supervised
New York: Academic, 1990. classification learning algorithms, Neural Comput., vol. 11, pp.
[2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd 18851892, 1999.
ed. New York: Wiley, 2001. [25] T. G. Dietterch, Approximate statistical tests for comparing super-
[3] T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Sta- vised classification learning algorithms, Neural Comput., vol. 10, pp.
tistical Learning: Data Mining Inference, and Prediction, ser. Statis- 18951923, 1998.
tics. Berlin, Germany: Springer-Verlag, 2001. [26] L. Prechelt, Proben1, A set of neural-network benchmark problems,
[4] A. R. Webb, Statistical Pattern Recognition, 2nd ed. London, U.K.: University of Karlsruhe, Germany, 1994 [Online]. Available: Avail-
Wiley, 2002. able: http://page.mi.fu-berlin.de/prechelt/Biblio/1994-21.pdf
[5] W. Duch, Similarity based methods: A general framework for classi- [27] C. L. Blake and C. J. Merz, UCI Repository of Machine Learning
fication, approximation and association, Control Cybern., vol. 29, no. database, Dept. Inf. Comput. Sci., Univ. California, Irvine, Irvine, CA,
4, pp. 937968, 2000. 1998 [Online]. Available: http://www.ics.uci.edu/~mlearn

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 6, 2009 at 11:00 from IEEE Xplore. Restrictions apply.

You might also like