0% found this document useful (0 votes)
43 views

Ranking-Based Instance Selection For Pattern Classification

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Ranking-Based Instance Selection For Pattern Classification

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Expert Systems With Applications 150 (2020) 113269

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Ranking-based instance selection for pattern classification


George D.C. Cavalcanti∗, Rodolfo J.O. Soares
Universidade Federal de Pernambuco (UFPE), Centro de Informática (CIn), Av. Jornalista Anibal Fernandes s/n, Cidade Universitária, Recife, PE 50740-560,
Brazil

a r t i c l e i n f o a b s t r a c t

Article history: In instance-based learning algorithms, the need to store a large number of examples as the training set
Received 16 May 2019 results in several drawbacks related to large memory requirements, oversensitivity to noise, and slow
Revised 6 January 2020
execution speed. Instance selection techniques can improve the performance of these algorithms by se-
Accepted 31 January 2020
lecting the best instances from the original data set, removing, for example, redundant information and
Available online 7 February 2020
noisy points. The relationship between an instance and the other patterns in the training set plays an
Keywords: important role and can impact its misclassification by learning algorithms. Such a relationship can be
Instance selection represented as a value that measures how difficult such instance is regarding classification purposes.
Ranking Based on that, we introduce a novel instance selection algorithm called Ranking-based Instance Selec-
Instance-based learning tion (RIS) that attributes a score per instance that depends on its relationship with all other instances in
k-nearest neighbor the training set. In this sense, instances with higher scores form safe regions (neighborhood of samples
Classification
with relatively homogeneous class labels) in the feature space, and instances with lower scores form an
indecision region (borderline samples of different classes). This information is further used in a selec-
tion process to remove instances from both safe and indecision regions that are considered irrelevant to
represent their clusters in the feature space. In contrast to previous algorithms, the proposal combines
a raking procedure with a selection process aiming to find a promising tradeoff between accuracy and
reduction rate. Experiments are conducted on twenty-four real-world classification problems and show
the effectiveness of the RIS algorithm when compared against other instance selection algorithms in the
literature.
© 2020 Elsevier Ltd. All rights reserved.

1. Introduction formed. Moreover, k-NN suffers rapidly with the level of attribute
noise in training instances (Aha et al., 1991).
Instance-based learning algorithms are classifiers that have Instance selection is one of the most effective approaches to in-
no learning phase. They use all training set instances as ex- crease the performance of instance-based classifiers (Garcia, Der-
emplars when generalizing. The k-Nearest Neighbor (k-NN) algo- rac, Cano, & Herrera, 2012). It is applied as a pre-processing step
rithm (Cover & Hart, 1967) is a well-known instance-based learn- on the training set. This kind of technique consists of reducing the
ing classifier. k-NN computes the distance from each input instance data used for establishing a classification rule by selecting relevant
to every stored example and labels them according to the class la- prototypes. With the reduction of the storage examples, the mem-
bel of their k nearest neighbors, which are their most similar pat- ory requirements and execution time for generalization are de-
terns in the training set. creased as well. A successful instance selection algorithm searches
The use of the whole training set makes k-NN be confronted to find the smallest subset of the original data set which yields
with several drawbacks, such as deciding how many exemplars to or even improves the classification accuracy of instance-based al-
store and what portion of the input space they should cover. Ex- gorithms. As the size of the data sets grows, algorithms that can
cessive storage can result in high memory usage and significant present a shortlist of representative samples selected from the
computational time increase, once multiple computations of simi- whole training data is of great importance.
larities between the test set and training examples would be per- In this work, we propose an instance selection algorithm called
Ranking-based Instance Selection (RIS). RIS calculates a score per
instance, and this score is used to evaluate if this instance should
be included in the final subset of prototypes. RIS is composed of

Corresponding author. two phases: ranking and selection. In the first phase, RIS starts
E-mail addresses: [email protected] (G.D.C. Cavalcanti), [email protected] (R.J.O. by assigning a score to each instance in the training set that rep-
Soares).

https://doi.org/10.1016/j.eswa.2020.113269
0957-4174/© 2020 Elsevier Ltd. All rights reserved.
2 G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269

resents how vital the instance is for further generalization. This burden of instance-based learning algorithms. Regardless of the
ranking procedure aims at evaluating the relationship among the main objective, the training set should have its number of in-
instances in the training set to identify which instances can be re- stances reduced.
moved, for representing irrelevant information or noisy points. The Instance reduction techniques rely on two different approaches:
selection process is performed next, where instances with higher instance selection (Garcia et al., 2012) and prototype genera-
scores are selected first. In other words, the gathered knowledge tion (Triguero, Derrac, Garcia, & Herrera, 2012). Instance selection
of the ranking procedure can be applied to choose a subset of the aims at finding the best subset of instances of the training dataset.
most critical instances from the training set, and consequently, im- In contrast, prototype generation creates new instances, called pro-
prove the performance of instance-based learning algorithms. To totypes, to represent the original dataset of instances. The main
the best of our knowledge, this procedure that juxtaposes the rak- advantage of selection algorithms is their capacity to choose rele-
ing calculation with the selection process was not employed previ- vant samples without generating new artificial data since, for some
ously. applications, the generation of new data is not a sound procedure.
RIS takes advantage of the fact that the relationship of an in- Moreover, in general terms, selection algorithms are more straight-
stance to other patterns in the training set impacts its misclas- forward than generation ones. Section 2.1 recalls concepts about
sification by learning algorithms. So, RIS calculates a score per instance selections methods, and Sections 2.2 reviews works that
instance that is determined by its relationship to the other in- apply instance selection to different areas of machine learning such
stances, and this score represents a soft decision where lower the as imbalance and ensemble learning.
score closer to the border is the instance. In other words, these
scores create a heat map that shows safe regions (neighborhood of 2.1. Instance selection
samples with relatively homogeneous class labels) in the feature
space, as well as indecision regions (borderline samples of differ- Instance selection methods, also known as prototype selection
ent classes). The selection phase of RIS uses this information to (PS), can be classified according to the type of selection in Con-
choose a subset of the original training set independent if these densation, Edition, and Hybrid (Kim & Oommen, 2003). The Edition
samples are borderline or if they belong to a homogeneous region. algorithms aim at cleaning the noisy instances and outliers with-
This is an interesting procedure because the principal contributor out any compromise to obtain a high reduction rate of the training
to increasing the likelihood of an instance being misclassified is dataset. On the other hand, Condensation focus on reducing the
the occurrence of outliers and border points in class overlapping dataset, maintaining borderline instances, and removing instances
regions of the feature space (Smith, Martinez, & Giraud-Carrier, that are far away from the frontiers among the classes. The Hybrid
2014). In this way, integrating into the learning process the knowl- algorithms aim to combine the best of the two previous mentioned
edge of which instances are hard to classify (the ones that have classes of algorithms. To do so, it selects the most representative
lower scores) can lead to an increase in the classification accuracy. instances regardless of the fact that the instances are at the border
The major contribution of this paper is the proposal of a family or positioned in safer regions of the feature space.
of instance selection methods, called Ranking-based Instance Se- From the point of view of the direction of
lection (RIS), composed of three algorithm RIS1, RIS2, and RIS3, search, Garcia et al. (2012) classified the PS techniques in In-
where each algorithm has a different compromise between accu- cremental, Decremental, Batch, Mixed, and Fixed. Incremental
racy and reduction rate of the training set. RIS1 has the best ac- starts with an empty subset, and the instance that satisfies a given
curacy, while RIS3 reduces the training set the most. Contrary to criterion is added to the subset. In contrast, Decremental begins
previous works, RIS incorporates in the selection process a score with the whole training set, and iteratively removes instances from
(which is real value instead of a binary decision) per instance it. In the Batch mode, the first task is to mark the instances that
that can be interpreted as a measure indicating how difficult it should be removed. After, all the marked instances are removed at
is to classify that instance. This measure proved to be an inter- once. Another strategy is to start with a preselected subset, and
esting strategy through a set of comprehensive experiments on at each iteration, add or remove instances from this subset; this
24 datasets of the KEEL repository. We show that the proposed one is called Mixed. Lastly, the Fixed search is a subfamily of the
method can considerably reduce the original size of the training Mixed search where the number of instances is predefined at the
set while improving the recognition rates. The results reached by begging of the search and never change.
our method compare favorably to other published methods. This Prototype selection algorithms have been employed in
paper extends the work in Pereira and Cavalcanti (2011b) in many many different applications, such as traffic sign recogni-
ways: (i) novel algorithms of the RIS family are proposed; (ii) a tion (Chen, Lin, Ke, & Tsai, 2015), handwritten connected digits
deeper explanation of the proposed approaches is presented; (iii) a classification (Pereira & Cavalcanti, 2011a), financial data min-
detailed analysis using a toy example is conducted to offer a bet- ing (jae Kim, 2006), and data summarization (Smith-Miles &
ter understanding of the proposed approach; (iv) the number of Islam, 2010).
real datasets used to evaluate the proposed approach is increased;
(v) a more rigorous statistical comparison is performed. 2.2. Instance selection applied to different areas
This work is organized as follows. Section 2 describes the ba-
sic concepts of the instance reduction research area, in special in- In the literature, there is a myriad of methods that aim to se-
stance selection, and also presents how instance selection has been lect the most relevant prototypes from a training set. These pro-
applied to different areas of machine learning. The proposed meth- posals appear in different machine learning areas, such as imbal-
ods are described in Section 3, along with an illustrative example ance learning, one-class classification, regression, big data, and en-
of the algorithm. Experimental studies of the proposed method on semble learning.
public machine learning datasets are conducted and analyzed in Instead of using classical under/oversampling techniques
Section 4. Finally, in Section 5, experimental results are summa- to address the class imbalance problem, Tsai, Lin, Hu, and
rized, and future works are suggested. Yao (2019) presented an approach that combines cluster-
ing analysis and instance selection techniques. In the same
2. Reduction techniques
vein, Kuncheva, Arnaiz-González, Díez-Pastor, and Gunn (2019) per-
Instance reduction algorithms have two main purposes: re- formed a theoretical evaluation on instance selection for imbal-
duce/eliminate noise and outliers, and reduce the computational anced data using the geometric mean as the performance measure.
G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269 3

Bien and Tibshirani (2011) proposed a prototype selection algo- respectively. To illustrate the RIS algorithm, we present a toy
rithm that attempts to presents a shortlist of representative sam- problem in Section 3.3. Sections 3.4 and 3.5 present two varia-
ples to increase its interpretative value. In other words, their work tions of the proposed method, RIS2 and RIS3, respectively. Finally,
used prototypes as a tool to make it for easy to human understand- Section 3.6 describes the classification rule adopted by the RIS al-
ing of the data. gorithms.
The majority of the PS algorithm was developed to deal
with multi-class classification problems. Krawczyk, Triguero, Gar-
3.1. Ranking
cía, Woźniak, and Herrera (2019) proposed PS algorithms that work
on scenarios in which we do not have access to counterexamples
Each instance has a score associated with it in such a way that
during the training phase. In other words, instead of dealing with
the instance surrounded by patterns having the same class of it has
multi-class, these algorithms focus on one-class classification. In
higher scores than borderline and noisy instances that are consid-
contrast to existing methods that use PS to classification, Arnaiz-
ered less important at first sight. The ranking construction works
González, Díez-Pastor, Rodríguez, and García-Osorio (2016) pro-
as follows: the score si is calculated for each instance xi ∈ X using
posed instance selection methods for regression tasks where the
Eq. (1), where m is the number of instances in the training set X.
aim is to predict continuous values instead of one class. Arnaiz-
González et al. also adapted instance selection algorithms for 
m
multi-label learning (Álvar Arnaiz-González, Díez-Pastor, Rodríguez, si = α (xi , x j ) × sm(xi , x j , X ). (1)
& García-Osorio, 2018). j=1: j=i
Standard PS methods cannot cope with huge data sets. To fulfill
this gap, Triguero, Peralta, Bacardit, Garca, and Herrera (2015) pro- For the calculation of the score si of the instance xi , all the
posed a partitioning methodology that uses a MapReduce-based patterns x j ∈ {X − {xi }} are taken into consideration. So, for each
framework to distribute the functioning of the PS algorithms x j ∈ {X − {xi }}, the functions α (.) (Eq. (2)) and sm(.) (Eq. (3)) are
through a cluster. calculated.
Ensemble learning aims at overcoming the precision of a

1, class(xi ) = class(x j )
classification task by combining classifiers. Cruz, Sabourin, and α ( xi , x j ) = (2)
−1, otherwise
Cavalcanti (2017), Cruz, Sabourin, and Cavalcanti (2018b) and
Cruz, Oliveira, Cavalcanti, and Sabourin (2019) showed that PS
exp(−d (xi , x j ))
methods can improve the accuracy of dynamic ensemble selection sm(xi , x j , X ) = m . (3)
techniques (Cruz, Sabourin, & Cavalcanti, 2018a) that a promising k=1:k=i exp (−d (xi , xk ))
multiple classifier approach in which the base classifiers are se-
The function α defines the sign of the adjustment. In other
lected on the fly, according to each new test sample to be classi-
words, if xi and xj have the same class label, the score (si ) of xi
fied.
is increased by the return value of the sm function (Eq. (3)). Oth-
For more information, some reviews of selection can be found
erwise, the score is subtracted by this value.
in the literature (Garcia et al., 2012; Kim & Oommen, 2003; Wilson
The function sm calculates the absolute value of the xi score
& Martinez, 20 0 0).
adjustment given xj . This value is given by the normalized expo-
3. Ranking-based instance selection algorithm nential transform of the distance between xi and xj . This transform
is also known as softmax (Bridle, 1990) and preserves the hierar-
The Ranking-based Instance Selection (RIS) is a hybrid instance chical order of the input values and it is a derivable generalization
selection technique that can be divided into two phases: Ranking of the “winner-takes-all” rule (Haykin, 1999). This strategy ensures
and Selection. During the Ranking phase, RIS associates a score to that smaller distances imply into higher score adjustments, which
each instance in the training set. This score represents the impor- is sound given that the influence of a pattern that is closer to xi
tance of each instance for classification purposes; the higher the should be higher than of those patterns far away.
score, the more important the instance. After the construction of After the evaluation of the whole set of instances in X, the
the ranking, the Selection phase keeps the instances having the scores si are normalized in the [0; 1] interval using Eq. (4),
highest ranking. si − min(S )
For the sake of simplicity, the notation used in this paper are scaling(si , S ) = , (4)
max(S ) − min(S )
listed as follows:
where min(S) and max(S) represent the minimum and the maxi-
X: the training dataset.
mum values in S, respectively.
C: the classes of the problem.
The scores of two artificial datasets (Circles and Banana –
m: the number of instances in X.
Fig. 1(a) and (b), respectively) were calculated by Eq. (1) and
xi : the ith instance in X.
showed using two different plots: surface and curve. It is relevant
R: the result set after the selection process.
to note that to improve the visual effect, each score was inverted
S: the score vector that represents the ranking of the
to 1 − score. So, in the surface plots (Fig. 1(c) and (d)), the higher
instances.
the inverted score of an instance, the higher the elevation level.
si : the score of xi , si ∈ S.
Fig. 1(e) and (f) show a 2D view of the surface and it is possible to
t: the threshold parameter.
note that the regions defined by the instances having the highest
I: the indexes vector.
scores preserve the original shapes of the datasets.
class(x): the class label of x.
min(V): the minimum value in vector V.
max(V): the maximum value in vector V. 3.2. Selection
d(x, y): the distance measure between the instances x and y.
argmin f (v ): the value of v that minimizes f. The instance selection phase aims at removing redundant and
v undesirable instances. Two definitions, radius and relevant distance,
Sections 3.1 and 3.2 describe the two main phases of the that are used by the proposed algorithm to select the best in-
Ranking-based Instance Selection (RIS), Ranking and Selection, stances are given below.
4 G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269

Fig. 1. Surface and contour graphs of the ranking.

Definition 1. The radius of an instance x, radius(x), is given by the its nearest enemy. An enemy is defined as any instance having a
radius of the largest hypersphere centered at x containing only in- different class of xi .
stances having the same class of x.

Definition 2. An instance xi ∈ X is considered relevant in the RIS


The radius of an instance, as described in Definition 1, can be algorithm if does not exist xr ∈ X, such as: i) class(xr ) = class(xi ),
obtained by calculating the distance between the instance xi and and ii) d(xi , xr ) ≤ radius(xr ); i = r.
G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269 5

3.3. Toy example

Fig. 2 shows the behavior of the RIS1 instance selection algo-


rithm applied to the XOR problem. To simulate a “Soft” XOR prob-
lem, a data set composed of two classes was artificially generated;
Fig. 2a shows the original data set consisting of 10 instances per
class. In Fig. 2b, each instance has a number associated with it
that represents the order in the ranking. So, the instance labeled
with the number one obtained the highest score, the instance la-
beled with 2 obtained the second highest score, and so on. The
scores are calculated following Algorithm 1 (lines 1–16). Instances
removed by the threshold (the second part of the logical expres-
sion in line 18) are represented by filled black markers. Note that
these instances are placed near the classification boundaries and
they are in the lowest positions of the ranking. This shows that
instances having the tendency to be noisy points either should be
removed by threshold or should have low selection priority.
In Fig. 2c, selected instances are represented by filled markers
(blue and red). First, the instance on the top of the ranking is se-
lected. Since the hyperspheres are defined before the noise elimi-
nation, the hypersphere of instance 1 is delimited by instance 18
that is its nearest enemy. Instances 2, 5, 7 and 9 are in the cov-
erage area of the instance 1, so they are removed (lines 19–21).
Instance 3 is the second selected instance and its coverage area is
delimited by instance 16. Instances 4 and 6 are removed because
they are in the coverage area of instance 3. This process continues
until the whole set of instances is evaluated. Fig. 2d shows the five
(out of 20) instances selected by RIS. Note that not only the first
elements of the ranking are selected, but those elements consid-
Algorithm 1. RIS1 Pseudocode.
ered relevant to cover the entire feature space.

3.4. RIS2: Scaling per class


Definition 2 can be summarized in one function (isRelevant -
The label of the instances are not taken into consideration in
Eq. (5)) that is used to identify relevant instances. If xi is relevant,
the scaling operation (Algorithm 1, lines 12–14). After scaling, the
the function returns true. Otherwise, false. In other words, every
instance that is at the top of the ranking has the maximum score
instance that belongs to the coverage area (defined by the radius)
and the instance that is at the bottom has the minimum score. This
of another instance of the same class is not relevant and should be
global scaling may cause a problem if many elements of a specific
removed.
class are grouped into the lower positions of the ranking. In this
case, all the instances that belong to that class can be removed

⎪  ∃xr ∈ R | while noisy instances from other classes are preserved.
⎨true,
class(xr ) = class(xi ) In order to address this problem, the scaling procedure can be
isRelevant (xi , R ) = (5)
⎪ ∧ d (xi , xr ) ≤ radius(xr ) performed per class. Algorithm 2 shows the scaling procedure of
⎩ the RIS2 algorithm that should replace lines 12–14 of Algorithm 1.
f alse, otherwise
One by one, the classes have their instances processed separately.
This local scaling aims to produce a more fair ranking than the one
The selection procedure starts by selecting the instance xi ∈ X generated by the RIS1 algorithm.
that has the highest score si ∈ S. This instance is added to the re-
sult set R. After, the instance xj with the second highest score is 3.5. RIS3: Avoiding noise influence in the coverage areas
evaluated. If class(xi ) and class(xj ) are different, xj is inserted in
R. Otherwise, xj is only selected if it is not located in the space In the lower left quadrant of Fig. 3, there is one small circle that
covered by the hypersphere centered at xi with radius radius(xi ). is completely inside a bigger one. These two circles are defined
The evaluation continues with the whole set of instances. Instances by two instances of the square () class. Only one of these two
are only selected if they do not fall in the covered area of the hy- instances should be used to map that area. However, the instance
perspheres generated by the patterns that have the same class as that represents the inner region is selected first because it has a
them. This strategy aims to preserve the best instances per cluster higher score than the instance that represents the bigger circle;
per class. so, algorithms RIS1 and RIS2 will maintain both instances. In fact,
Algorithm 1 shows the pseudo-code of the proposed instance the problem is not the selection order, but the small size of the
selection method (RIS1). From line 3 to line 11, the algorithm cal- coverage area of the inner instance. Observe that the hypersphere
culates the scores per instance. After, from line 12 to 14, the scores of this instance is delimited by an enemy (ball instance) that is not
are normalized in the [0; 1] interval. The scores S and the training selected.
set X are reordered in descending order of the instance’s scores si In the RIS3 algorithm, the definition of the coverage areas is
(lines 15 and 16). Each instance is added to the result set R if its performed after the elimination of the instances that have smaller
score is higher than a given threshold and if it is a relevant in- scores than the threshold parameter. Fig. 3b shows the result set
stance (Eq. (5)) (lines 18–23). Instances having scores lower than produced by RIS3 for the same data used in Fig. 3a. In Fig. 3b, the
the threshold are discarded. coverage area of the selected instances increased compared with
6 G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269

Fig. 2. RIS selection process with 2-D artificial data points.

So, the class of a query instance x is given by the class of the in-
stance xr , such that:
d (x, xr )
class(x ) = argmin (6)
class(xr ) radius(xr )

This criterion is of particular importance in the classification of


new points that are located on the borders of the hyperspheres.
This procedure aims to prevent those instances with small hyper-
Algorithm 2. RIS2 scaling procedure. spheres located near the borders of instances with large hyper-
spheres take advantage in the decision.

Fig. 3a since the redundant information was removed. Thus, the 4. Experiments
overlapping of the coverage areas defined by the instances belong-
ing to the same class is eliminated. Experiments were carried out using 24 databases (Table 1)
from the KEEL repository (Alcalá-Fdez et al., 2011). All the exper-
3.6. Classification rule iments reported use the 10-fold cross-validation procedure where
the prior probability of each class is preserved in each fold. All the
In the proposed selection methods, all instances located in the features were normalized to a number between 0 and 1.
space delimited by the hypersphere of a selected instance xr are The proposed algorithms are compared with k-Nearest Neigh-
removed if they have the same class of xr . In order to compensate bor classifier with euclidean distance, Edited Nearest Neigh-
this removal, the radius of the selected instances is used in the dis- bors (ENN) (Wilson, 1972), Decremental Reduction Optimiza-
tance calculation - the higher the radius the smaller the distance. tion Procedure 3 (DROP3) (Wilson & Martinez, 20 0 0), and
G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269 7

Fig. 3. Removing redundant information with RIS3.

Table 1 Among them, the DROP3 algorithm obtains the best mix of storage
Dataset characteristics.
reduction and generalization accuracy (Wilson & Martinez, 20 0 0).
Dataset #Instances #Features #Classes As for performance measures, all experiments report accu-
adult 45,222 14 2 racy rate and reduction percentage. The latter represents how
appendicitis 106 7 2 much the original training set was reduced. In order to verify
balance 625 4 3 whether differences in accuracy and storage reduction are signif-
bupa 345 6 2 icant, a non-parametric Wilcoxon signed-rank test was performed,
coil2000 9822 85 2
whose decision does not depend on the pool of classifiers included
connect-4 67,557 42 3
contraceptive 1473 9 3 in the original experiment, as reported in Benavoli, Corani, and
haberman 306 3 2 Mangili (2016).
hayes-roth 160 4 3 The next section shows the results and compares the RIS family
heart 270 13 2
with other algorithms in the literature. In Section 4.2, we discuss
ionosphere 351 33 2
led7digit 500 7 10 about the parameter t of the proposed algorithms.
marketing 6876 13 9
monk-2 432 6 2
4.1. Results
movement-libras 360 90 15
pima 768 8 2
satimage 6435 36 7 In the RIS family, an instance is included in the final prototype
segment 2310 19 7 subset if its score is higher than the parameter t (Algorithm 1 –
titanic 2201 3 2
line 18). This parameter is defined during the memorization phase
vowel 990 13 11
wine 178 13 3
using the training set. So, one threshold is defined for each round
winequality-red 1599 11 6 of the 10-fold cross-validation. In other words, 9-folds are used to
winequality-white 4898 11 7 define the threshold t, and this value of t is applied on the remain-
yeast 1484 8 10 der fold (test set). On using the RIS algorithm, it is expected that
different values of t, in the interval [0; 1], generate different sub-
sets of prototypes. So, for each generated subset, the performance
of the RIS algorithm is calculated, and the value of t that yields the
best accuracy rate on the training set is selected and used to assess
ATISA1 (Cavalcanti, Ren, & Pereira, 2013). ENN is an edition
the test set. So, the following procedure is performed to calculate
and decremental method, while ATISA and DROP3 are hybrid
the threshold t∗ per round:
and decremental (Garcia, Luengo, & Herrera, 2014). Hybrid meth-
ods maintain inner and border instances, and according to (1) for t = [0.1, 0.2, . . . , 1.0]
Garcia et al. (2012), this family of methods obtains an excellent (2) apply RIS on the training set using t as threshold: P =
tradeoff between reduction and accuracy. RIS(T , t )
The ENN rule is a widely used reduction technique that re- (3) evaluate P on the whole training set T
moves instances that are not correctly classified by their k nearest (4) endfor
neighbors. ATISA1 uses the distance of each instance to its nearest (5) t∗ is defined as the value of t that reached the best accuracy
enemy as a threshold and removes redundant instances that lay in line 3)
down in the coverage area delimited by this threshold. In contrast
to ENN that only removes borderline instances, ATISA1 removes in- Table 2 shows the accuracy rate obtained by the proposed al-
ner and borderline instances. Decremental Reduction Optimization gorithms compared with the k-Nearest Neighbor (k-NN) classifier,
(DROP) is a family composed of five algorithms that remove an in- where the value of k was defined per dataset as the best value
stance if the accuracy of the k nearest neighbors rule on the set among {1,3,5} on the training set. It also shows the reduction per-
of its associates (all the instances where the analysed instance be- centage (R), except for the k-NN since it uses the entire training
longs to the set of k nearest neighbors of them) does not decrease. set.
8 G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269

Table 2
Accuracy rates and reduction percentages (R) of the proposed algorithms: RIS1, RIS2, and
RIS3. The symbols + , − and ~ , in the row labeled “Wilcoxon”, indicate that RIS1 is
significantly better, worse or equivalent compared with the other algorithms, respectively.
The best results are in bold.

Dataset k-NN RIS 1 R RIS 2 R RIS 3 R

adult 78.99 75.17 47.25 75.17 47.25 75.17 47.25


appendicitis 74.08 80.79 70.64 79.14 72.42 79.14 72.84
balance 81.90 86.72 76.53 84.17 84.76 81.48 90.91
bupa 58.40 57.97 43.83 56.85 50.18 57.14 56.14
coil2000 90.84 94.03 31.01 87.78 27.10 87.76 25.11
connect-4 56.55 65.83 52.81 65.57 46.74 65.80 95.30
contraceptive 44.95 46.91 26.14 47.05 38.36 45.69 41.37
haberman 62.81 64.98 49.20 65.65 58.02 66.34 64.78
hayes-roth 58.56 66.42 44.79 63.49 48.42 61.46 51.47
heart 76.26 74.44 50.99 77.41 61.77 78.89 77.86
ionosphere 84.42 90.62 75.50 90.63 74.64 90.92 74.89
led7digit 61.27 72.37 16.53 67.27 57.76 74.30 92.53
marketing 21.80 18.28 82.96 15.20 72.30 15.63 50.43
monk-2 95.36 92.11 72.79 92.60 83.04 92.83 87.27
movement-libras 65.60 78.11 62.28 60.56 70.73 59.22 72.58
pima 69.22 63.40 53.78 63.27 52.82 63.27 53.27
satimage 21.49 25.05 52.74 18.27 60.82 18.19 13.76
segment 94.52 92.12 88.96 91.82 90.51 91.65 91.51
titanic 63.79 69.06 17.61 56.44 34.28 55.48 68.34
vowel 93.84 87.88 76.90 66.06 83.45 65.56 84.39
wine 89.79 93.39 86.64 91.62 86.15 92.25 87.83
winequalty-red 51.47 45.83 42.01 44.78 64.83 44.72 69.35
winequality-white 41.19 41.55 41.83 41.81 60.69 42.16 62.91
yeast 42.10 41.77 36.27 39.89 56.02 40.17 60.57
Mean 65.80 67.70 54.17 64.38 61.70 64.46 66.26
Wilcoxon ~ n/a n/a + − + −

Table 3
Accuracy rates (%) and reduction percentages (R) of RIS1, ENN, DROP3, and ATISA1 algorithms. The best
results are in bold.

Dataset RIS 1 R ENN R DROP3 R ATISA1 R

adult 75.17 47.25 81.01 20.07 80.69 91.38 80.54 84.68


appendicitis 80.79 70.64 78.33 15.52 78.33 90.98 78.33 89.62
balance 86.72 76.53 87.14 17.78 85.24 83.25 84.92 73.07
bupa 57.97 43.83 57.43 36.84 56.00 74.30 56.86 44.93
coil2000 94.03 31.01 93.96 80.39 93.96 99.96 93.94 84.17
connect-4 65.83 52.81 63.43 89.73 67.71 99.47 59.46 98.27
contraceptive 46.91 26.14 49.13 56.21 48.66 80.95 50.34 67.74
haberman 64.98 49.20 69.06 31.23 70.00 89.14 69.69 78.87
hayes-roth 66.42 44.79 42.22 76.60 44.44 87.44 36.11 81.25
heart 74.44 50.99 79.26 30.62 76.30 83.46 81.11 78.11
ionosphere 90.62 75.50 82.22 15.89 82.22 93.07 83.89 72.43
led7digit 72.37 16.53 39.82 83.13 39.45 94.82 41.27 85.84
marketing 18.28 82.96 19.88 98.92 21.32 99.79 21.14 99.67
monk-2 92.11 72.79 95.45 11.42 79.32 96.73 95.45 37.73
movement-libras 78.11 62.28 71.94 23.04 65.27 62.81 68.05 55.94
pima 63.40 53.78 71.30 26.43 71.95 82.29 71.69 61.47
satimage 25.05 52.74 21.07 99.79 21.07 99.93 21.07 99.87
segment 92.12 88.96 90.74 05.05 91.73 88.58 92.47 80.85
titanic 69.06 17.61 32.17 99.14 32.17 99.84 32.17 99.14
vowel 87.88 76.90 90.71 03.14 87.07 52.83 88.89 51.16
wine 93.39 86.64 89.47 04.31 89.47 84.39 90.53 74.66
winequality-red 45.83 42.01 52.96 48.29 52.16 81.41 53.40 61.09
winequality-white 41.55 41.83 42.02 51.99 43.34 81.82 42.08 62.00
yeast 41.77 36.27 48.69 46.96 47.32 81.67 48.63 65.00
Mean 67.70 54.17 64.39 45.28 63.56 87.58 64.00 75.10
Wilcoxon-p n/a n/a 0.731 0.391 0.345 < 0.001 0.637 0.002

Results show that among the RIS family, RIS1 obtained the best bols + , − and ~ , in that row, indicate that RIS1 is significantly
accuracy rate, while RIS3 was the best regarding the reduction rate. better, worse or equivalent compared with the other algorithms,
Important to remark that even being the method with the low- respectively.
est reduction rate in the RIS family, RIS1 was able to discard more Based on the trade-off between the accuracy rates and the re-
than 54 percent of the training data, on average, without hindering duction percentage showed in Table 2, RIS1 was chosen to repre-
the accuracy. Compared to the k-NN, RIS1 obtained a better accu- sent the RIS family in the comparison study with the state-of-the-
racy, almost two percentile points, on average. The last row of the art algorithms (Table 3). The proposed approach is compared with
table shows the results of the non-parametric Wilcoxon signed- the following algorithms: Edited Nearest Neighbors (Wilson, 1972),
rank test for zero median at a 0.05 significance level. The sym- Decremental Reduction Optimization Procedure 3 (DROP3) (Wilson
G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269 9

Fig. 4. Accuracy x Reduction chart. Each point represents the average accuracy and reduction rates for each instance selection algorithm.

Table 4
Accuracy rate (%) for the threshold obtained using the validation (tval ) and test (ttst ) sets.  shows the difference
between them in percentile points.

RIS 1 RIS 2 RIS 3


Dataset
ttrain ttest  ttrain ttest  ttrain ttest 
adult 75.17 75.43 00.26 75.17 77.09 01.92 75.17 77.13 01.96
appendicitis 80.79 84.70 03.91 79.14 92.26 13.12 79.14 92.26 13.12
balance 86.72 89.93 03.21 84.17 87.70 03.53 81.48 85.95 04.47
bupa 57.97 61.43 03.46 56.85 62.31 05.46 57.14 62.34 05.20
coil2000 94.03 94.03 00.00 87.78 87.78 00.00 87.76 87.77 00.01
connect-4 65.83 65.83 00.00 65.57 65.59 00.02 65.80 65.80 00.00
contraceptive 46.91 49.15 02.24 47.05 48.95 01.90 45.69 49.15 03.46
haberman 64.98 74.16 09.18 65.65 70.23 04.58 66.34 69.62 03.28
hayes-roth 66.42 71.94 05.52 63.49 66.40 02.91 61.46 64.44 02.98
heart 74.44 82.59 08.15 77.41 84.07 06.66 78.89 85.19 06.30
ionosphere 90.62 92.89 02.27 90.63 90.92 00.29 90.92 91.22 00.30
led7digit 72.37 73.34 00.97 67.27 68.84 01.57 74.30 76.44 02.14
marketing 18.28 18.34 00.06 15.20 15.53 00.33 15.63 16.06 00.43
monk-2 92.11 92.11 00.00 92.60 92.84 00.24 92.83 94.00 01.17
movement-libras 78.11 80.11 02.00 60.56 60.56 00.00 59.22 59.22 00.00
pima 63.40 69.00 05.60 63.27 68.08 04.81 63.27 67.57 04.30
satimage 25.05 25.13 00.08 18.27 18.31 00.04 18.19 18.20 00.01
segment 92.12 92.60 00.48 91.82 91.82 00.00 91.65 91.65 00.00
titanic 69.06 69.06 00.00 56.44 56.75 00.31 55.48 56.21 00.73
vowel 87.88 87.88 00.00 66.06 66.06 00.00 65.56 65.56 00.00
wine 93.39 93.95 00.56 91.62 93.32 01.70 92.25 97.25 05.00
winequalty-red 45.83 51.46 05.63 44.78 48.67 03.89 44.72 50.61 05.89
winequality-white 41.55 45.06 03.51 41.81 43.39 01.58 42.16 42.92 00.76
yeast 41.77 42.37 00.60 39.89 46.85 06.96 40.17 45.75 05.58
Mean 67.70 70.10 02.40 64.38 66.85 02.47 64.46 67.18 02.72

& Martinez, 20 0 0), and ATISA1 (Cavalcanti et al., 2013). Excluding observe that the algorithms present a similar precision. As high-
ATISA1, all compared instance selection methods are implemented lighted by the boldface values in the table, there are some algo-
in the KEEL software (Garcia et al., 2012). rithms that perform much better than the others in some datasets.
The experimental results reported in Table 3 show that RIS1 This suggests that a procedure to choose the algorithm depending
reached the highest average accuracy rates in 10 out of 24 data on the task could achieve impressive rates.
sets. While ENN, DROP3, and ATISA obtained the best accuracy Fig. 4 shows a two-dimensional plot: accuracy versus reduction.
rates in 5 out of 24 datasets each. Each point in this figure represents the average value showed in
ENN is an edition algorithm that aims at removing points that Tables 2 and 3. The top right corner is the best possible solution.
contradicts their neighbors, so, as expected, its strategy was the RIS1 is ahead in terms of accuracy and RIS3 reached an interesting
one with the lowest level of reduction. In contrast, DROP3 is a hy- compromise between generalization accuracy and reduction rate.
brid algorithm that has obtained high reduction rates at the cost of
harming accuracy. ATISA, which also belongs to the hybrid family 4.2. Analysis of the parameter t
algorithm, obtained similar precision when compared with DROP3.
On average, RIS1 outperformed DROP3 and ATISA by almost four The RIS algorithms have the parameter t that is a threshold
percentile points regarding accuracy. The last row of the table used to select the relevant instances. In the previous section, this
shows the p-value of the Wilcoxon signed-rank test, and we can threshold was defined using the training set. In this section, we
10 G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269

conduct an evaluation supposing that the best threshold can be For future work, we intend to investigate how to select the
found. It is a kind of Oracle in which the best threshold is defined most appropriate version of the RIS algorithms per dataset using
using the test dataset. This analysis aims at verifying if there is data complexity measures (de Oliveira Moura, de Freitas, Cardoso,
room for improvement in the threshold definition procedure, i.e., is & Cavalcanti, 2014), as well as how to define the parameter t of
the optimal threshold set during training? Moreover, if this thresh- the proposed approach better. Furthermore, it remains to be in-
old obtained during training is not the best, how much could we vestigated further the effectiveness of the novel method in an in-
improve the accuracy rate given that we know the optimal thresh- stance level, making use of instance hardness (Smith et al., 2014)
old. and other instance properties.
Table 4 shows the accuracy rates for the threshold calculation
using the training (ttrain ) e test (ttest ) datasets for the RIS algo- Declaration of Competing Interest
rithms. The column  shows the difference between the accuracy
rate achieved using ttest and ttrain . RIS1 obtained the smaller dif- The authors declare that they have no known competing finan-
ference (on average) between the hit rates of validation and test cial interests or personal relationships that could have appeared to
sets, meaning that RIS1 was able to better approximate the pre- influence the work reported in this paper.
dicted threshold to the ideal one. The difference () was zero in
5 datasets for RIS1, and this means that for these datasets, no im- Acknowledgements
provement was achieved since the best threshold was found dur-
ing training. The same can be observed for RIS2 and RIS3, in four The authors would like to thank Brazilian agencies: CAPES (Co-
and three datasets, respectively. It is also possible to observe that ordenação de Aperfeiçoamento de Pessoal de Nível Superior), CNPq
for some datasets, the difference  is high. For instance, RIS2 and (Conselho Nacional de Desenvolvimento Científico e Tecnológico)
RIS3 could benefit from an increment of more than 13 percentile and FACEPE (Fundação de Amparo à Ciência e Tecnologia de Per-
points in the appendicitis datasets if the best t is chosen. nambuco).
We also performed a statistical test to verify the significance
of the results given that we have the best threshold. The non- References
parametric Wilcoxon signed rank test for zero median at a 0.05
Aha, W. , D., Kibler, Dennis, Albert, & K. , M. (1991). Instance-based learning algo-
significance level showed that all the RIS algorithms are signifi-
rithms. Machine Learning, 6(1), 37–66.
cantly better than the literature algorithms discussed in the last Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., & Her-
section if we are able to determine the threshold better. rera, F. (2011). Keel data-mining software tool: Data set repository, integration
of algorithms and experimental analysis framework. Journal of Multiple-Valued
Logic and Soft Computing, 17(2–3), 255–287.
Álvar Arnaiz-González, Díez-Pastor, J.-F., Rodríguez, J. J., & García-Osorio, C. (2018).
5. Conclusion Study of data transformation techniques for adapting single-label prototype se-
lection algorithms to multi-label learning. Expert Systems with Applications, 109,
114–130.
In this work, we introduced a new instance selection algorithm, Arnaiz-González, Á., Díez-Pastor, J. F., Rodríguez, J. J., & García-Osorio, C. I. (2016).
called Ranking-based Instance Selection (RIS), that aims at select- Instance selection for regression by discretization. Expert Systems with Applica-
ing a subset of the original training set. In this sense, it is sim- tions, 54, 340–350.
Benavoli, A., Corani, G., & Mangili, F. (2016). Should we really use post-hoc tests
ilar to hybrid methods previously published. But, in contrast, RIS based on mean-ranks? Journal of Machine Learning Research, 17(5), 1–10.
uses a ranking strategy to select the best instances in terms of Bien, J., & Tibshirani, R. (2011). Prototype selection for interpretable classification.
classification. This ranking is based on a score assigned to each The Annals of Applied Statistics, 5(4), 2403–2424.
Bridle, J. S. (1990). Probabilistic interpretation of feedfoward classification network
instance. The higher the number of nearby patterns of the same outputs, with relationships to statistical pattern recognition. In F. Fougelman–
class, the higher the instance score. Therefore, the ranking implies Soulie, & J. Harault (Eds.), Neuro-computing: Algorithms, architectures and appli-
that borderline or noisy instances have low priority. In addition to cations (pp. 227–236). New York: Springer-Verlag.
Cavalcanti, G. D. C., Ren, T. I., & Pereira, C. L. (2013). Atisa: Adaptive thresh-
the ranking, each instance is responsible for an area of coverage in old-based instance selection algorithm. Expert Systems with Applications, 40(17),
the feature space. This area is defined by a hypersphere that has 6894–6900.
the instance as its center and is delimited by its nearest enemy. Chen, Z.-Y., Lin, W.-C., Ke, S.-W., & Tsai, C.-F. (2015). Evolutionary feature and in-
stance selection for traffic sign recognition. Computers in Industry, 74, 201–211.
The selection process uses the scores to establish an order of the
Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans-
instances in the training set, and instances that belong to the cov- actions on Information Theory, 13(1), 21–27.
erage area of another instance of the same class are removed. Cruz, R. M., Oliveira, D. V., Cavalcanti, G. D., & Sabourin, R. (2019). Fire-des++: En-
The scores calculated in the proposed approach can be used hanced online pruning of base classifiers for dynamic ensemble selection. Pat-
tern Recognition, 85, 149–160.
as a heat map that shows border and safe regions of the feature Cruz, R. M., Sabourin, R., & Cavalcanti, G. D. (2018a). Dynamic classifier selection:
space. In our work, we used 2-dimensional toy problems to illus- Recent advances and perspectives. Information Fusion, 41, 195–216.
trate this property of scores. However, this could also be investi- Cruz, R. M. O., Sabourin, R., & Cavalcanti, G. D. C. (2017). Analyzing different pro-
totype selection techniques for dynamic classifier and ensemble selection. In
gated for high dimensional tasks. One limitation of the proposed International joint conference on neural networks (IJCNN) (pp. 3959–3966).
approach resides in how the instance coverage area is defined. It is Cruz, R. M. O., Sabourin, R., & Cavalcanti, G. D. C. (2018b). Prototype selection for
assumed to be a hypersphere centered at the instance, with radius dynamic classifier and ensemble selection. Neural Computing and Applications,
29(2), 447–457.
equals to the distance to its nearest enemy. So, a different strategy Garcia, S., Derrac, J., Cano, J., & Herrera, F. (2012). Prototype selection for nearest
(probably not a hypersphere) could benefit from avoiding using in- neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pat-
formation from only one direction (nearest enemy) while the other tern Analysis and Machine Intelligence, 34(3), 417–435.
Garcia, S., Luengo, J., & Herrera, F. (2014). Data preprocessing in data mining. Springer
directions are not used in the definition of the instance coverage
Publishing Company.
area. Haykin, S. (1999). Neural networks: A comprehensive foundation, 2/E. Upper Saddle
The three versions of the proposed technique were evaluated River, NJ, USA: Prentice Hall PTR.
jae Kim, K. (2006). Artificial neural networks with evolutionary instance selection
on twenty four datasets. The experimental study provided empir-
for financial forecasting. Expert Systems with Applications, 30(3), 519–526. Intel-
ical evidence of the effectiveness of the proposed techniques for ligent Information Systems for Financial Engineering
outperforming literature algorithms. In general terms, RIS1 obtains Kim, S.-W., & Oommen, B. J. (2003). A brief taxonomy and ranking of creative pro-
the highest average generalization accuracy, RIS3 obtains the best totype reduction schemes. Pattern Analysis & Applications, 6(3), 232–244.
Krawczyk, B., Triguero, I., García, S., Woźniak, M., & Herrera, F. (2019). Instance re-
accuracy rates in reducing the data sets, and RIS2 is a balanced duction for one-class classification. Knowledge and Information Systems, 59(3),
choice between accuracy and reduction. 601–628.
G.D.C. Cavalcanti and R.J.O. Soares / Expert Systems With Applications 150 (2020) 113269 11

Kuncheva, L. I., Arnaiz-González, Á., Díez-Pastor, J.-F., & Gunn, I. A. D. (2019). In- Triguero, I., Derrac, J., Garcia, S., & Herrera, F. (2012). A taxonomy and experimental
stance selection improves geometric mean accuracy: a study on imbalanced study on prototype generation for nearest neighbor classification. IEEE Transac-
data classification. Progress in Artificial Intelligence, 8(2), 215–228. tions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(1),
de Oliveira Moura, S., de Freitas, M. B., Cardoso, H. A., & Cavalcanti, G. D. (2014). 86–100.
Choosing instance selection method using meta-learning. IEEE International Con- Triguero, I., Peralta, D., Bacardit, J., Garca, S., & Herrera, F. (2015). MRPR: A mapre-
ference on Systems, Man, and Cybernetics, 20 03–20 07. duce solution for prototype reduction in big data classification. Neurocomput-
Pereira, C. S., & Cavalcanti, G. D. C. (2011a). Handwritten connected digits detection: ing, 150, 331–345. Bioinspired and knowledge based techniques and applications
An approach using instance selection. In IEEE International conference on image The Vitality of Pattern Recognition and Image Analysis Data Stream Classifica-
processing (pp. 2613–2616). tion and Big Data Analytics
Pereira, C. S., & Cavalcanti, G. D. C. (2011b). Instance selection algorithm based Tsai, C.-F., Lin, W.-C., Hu, Y.-H., & Yao, G.-T. (2019). Under-sampling class imbalanced
on a ranking procedure. In International joint conference on neural networks datasets by combining clustering analysis and instance selection. Information
(pp. 2409–2416). Sciences, 477, 47–54.
Smith, R. M., Martinez, T., & Giraud-Carrier, C. (2014). An instance level analysis of Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited
data complexity. Machine Learning, 95(2), 225–256. data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.
Smith-Miles, K., & Islam, R. (2010). Meta-learning for data summarization based Wilson, D. R., & Martinez, T. R. (20 0 0). Reduction techniques for instance-based
on instance selection method. In IEEE Congress on evolutionary computation learning algorithms. Machine Learning, 38(3), 257–286.
(pp. 1–8).

You might also like