0% found this document useful (0 votes)
38 views

Oversampling Algorithms in R

The document introduces a new R package called "imbalance" for dealing with imbalanced classification problems. The package includes recent oversampling algorithms to rebalance imbalanced datasets prior to performing machine learning. It contains 12 techniques, including some of the most recent approaches not currently available in other R packages. The package aims to provide state-of-the-art methods for preprocessing imbalanced data as well as tools to visualize the effects of the oversampling algorithms. An example using the package on a thyroid disease dataset is presented to illustrate its use.

Uploaded by

saumal20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Oversampling Algorithms in R

The document introduces a new R package called "imbalance" for dealing with imbalanced classification problems. The package includes recent oversampling algorithms to rebalance imbalanced datasets prior to performing machine learning. It contains 12 techniques, including some of the most recent approaches not currently available in other R packages. The package aims to provide state-of-the-art methods for preprocessing imbalanced data as well as tools to visualize the effects of the oversampling algorithms. An example using the package on a thyroid disease dataset is presented to illustrate its use.

Uploaded by

saumal20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

JID: KNOSYS

ARTICLE IN PRESS [m5G;September 16, 2018;15:4]


Knowledge-Based Systems 0 0 0 (2018) 1–13

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Imbalance: Oversampling algorithms for imbalanced classification in R


Ignacio Cordón, Salvador García∗, Alberto Fernández, Francisco Herrera
DaSCI Andalusian Institute of Data Science and Computational Intelligence, University of Granada, Spain

a r t i c l e i n f o a b s t r a c t

Article history: Addressing imbalanced datasets in classification tasks is a relevant topic in research studies. The main
Received 1 March 2018 reason is that for standard classification algorithms, the success rate when identifying minority class
Revised 14 June 2018
instances may be adversely affected. Among different solutions to cope with this problem, data level
Accepted 25 July 2018
techniques have shown a robust behavior. In this paper, the novel imbalance package is introduced.
Available online xxx
Written in R and C++, and available at CRAN repository, this library includes recent relevant oversampling
Keywords: algorithms to improve the quality of data in imbalanced datasets, prior to performing a learning task. The
Oversampling main features of the package, as well as some illustrative examples of its use are detailed throughout this
Imbalanced classification manuscript.
Machine learning
Preprocessing © 2018 Elsevier B.V. All rights reserved.
SMOTE

1. Introduction First, regarding Java we may find a complete module of imbal-


anced classification in the KEEL software suite [6]. It comprises
The imbalance classification problem is probably one of the a very complete collection of external and internal approaches, as
most researched problems in the machine learning framework [1– well as a large number of ensemble methods that work at both
5]. Its classical definition is a binary classification problem where levels.
we are given a set of training instances, labeled with two possible For Python, there exists a very recent tool-box named as
classes and a set of unlabeled instances, namely test set, to classify imbalanced-learn [7]. Similar to KEEL, it includes solutions
them using the information provided by the former one. When the based on preprocessing and ensemble learning.
size of a class, which usually represents the most important con- Finally, for R we may find several packages at CRAN which in-
cept to predict, is much lower than the other one, we have an im- clude oversampling and undersampling methods. Specifically, we
balance classification problem. must refer to unbalanced [8], smotefamily [9], and rose
In these cases, standard classification learning algorithms may [10,11].
be biased toward the majority class examples. In order to address However, there are two main issues associated with the
this issue, a common approach is to apply a preprocessing stage aforementioned software solutions. On the one hand, only
for rebalancing the training data. This can be carried out either imbalanced-learn allows a straightforward representation of
by undersampling, i.e. removing majority class instances, or over- the preprocessed datasets. This fact is very important in order to
sampling, i.e., introducing new minority class instances. Since un- acknowledge the actual areas that are reinforced for the minority
dersampling may remove some relevant instances, oversampling is class examples. On the other hand, and possibly the most signifi-
usually preferred. Additionally, this procedure is intended to rein- cant point, we have observed that none of them contain the latest
force the concept represented by the minority class, so that the approaches proposed in the specialized literature.
learning algorithm will be guided to avoid misclassifying these ex- In this paper, we present a novel, robust and up-to-date R pack-
amples. age including preprocessing techniques for the imbalance classifi-
Plenty of excellent oversampling algorithms arise every day in cation. Named as imbalance, it aims to provide both the state-
the scientific literature, but the software is rarely released. To the of-the-art methods for oversampling algorithms, as well as some
best of our knowledge, there are just few open source libraries and of the most recent techniques which still lack an implementation
packages that include methods and techniques related to imbal- in the R language. In this sense, we intend to provide a significant
anced classification. contribution to the already available tools that address the same
problem.
To present this novel package, the rest of the manuscript

Corresponding author. is arranged as follows. First, Section 2 presents the soft-
E-mail address: [email protected] (S. García).

https://doi.org/10.1016/j.knosys.2018.07.035
0950-7051/© 2018 Elsevier B.V. All rights reserved.

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

2 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

Fig. 1. PDFOS applied to newthyroid1 dataset.

Table 1
Comparison of the proposed imbalance package to the available R packages for imbal-
anced classification.

Property Imbalance Unbalanced Smotefamily Rose


Version 1.0.0 2.0 1.2 0.0-3
Date 2018-02-18 2015-06-26 2018-01-30 2014-07-15
#Techniques 12 9 6 1

Undersampling ✗ ✗ ✗
√ √ √ √
Oversampling
√ √ √
SMOTE (& var.) ✗

Advanced OverS. ✗ ✗ ✗
√ √
Filtering ✗ ✗
√ √
Wrapper ✗ ✗

Visualization ✗ ✗ ✗

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 3

Fig. 2. PDFOS + NEATER applied to newthyroid1.

Table 2
Code metadata (mandatory).

Nr. Code metadata description Please fill in this column

C1 Current code version 1.0.0


C2 Permanent link to code/repository used of this code version github.com/ncordon/imbalance
C3 Legal Code License GPL ( ≥ 2)
C4 Code versioning system used git
C5 Software code languages, tools, and services used R ( ≥ 3.3.0), C++
C6 Compilation requirements, operating environments and dependencies Rcpp, bnlearn, KernelKnn, ggplot2, mvtnorm
C7 If available Link to developer documentation/manual ncordon.github.io/imbalance
C8 Support email for questions [email protected]

ware framework and enumerates the implemented algorithms. 2. Software


Then, Section 3 shows some illustrative examples. Finally,
Section 4 presents the conclusions. The significance of oversampling techniques in imbalanced clas-
sification is beyond all doubt. The main reason is related to a
smart generation of new artificial minority samples in those ar-
eas that need reinforcement for the learning of class-fair classifiers.

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

4 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

Fig. 3. Original banana dataset.

Fig. 4. Imbalanced banana dataset.

Fig. 5. SMOTE applied to imbalanced banana.

Since the original SMOTE algorithm in 2002 [12], many different original SMOTE algorithm [12,13]. It assigns higher weight to
approaches have been designed to improve the classification per- borderline instances, undersized minority clusters and exam-
formance under different scenarios [13,14]. In this new developed ples near the borderline of the two classes.
software package, we have compiled some of the newest oversam- • racog, wracog. Rapidly Converging Gibbs (RACOG) and
pling algorithms, which are listed below: wrapper-based RACOG (wRACOG), both proposed by [16], work
for discrete attributes. They generate new examples with re-
• mwmote. The Majority Weighted Minority Oversampling Tech- spect to an approximated distribution using a Gibbs Sampler
nique (MWMOTE), first proposed in [15], is an extension of the scheme. RACOG needs the number of instances to generate be-

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 5

Fig. 6. MWMOTE applied to imbalanced banana.

Fig. 7. RWO applied to imbalanced banana.

Fig. 8. PDFOS applied to imbalanced banana.

forehand. wRACOG requires a target classifier to show no im- Gaussian kernel methods to locally approximate the minority
provement to stop generating examples. class.
• rwo. Random Walk Oversampling (RWO) is an algorithm in-
troduced by [17], which generates synthetic instances so that
mean and deviation of numerical attributes remain close to the Apart from those oversampling methods, we provide a filter-
original ones. ing method called neater. The filteriNg of ovErsampled dAta us-
• pdfos. Probability Distribution density Function estimation based ing non cooperaTive gamE theoRy (NEATER), introduced in [19], is
Oversampling (PDFOS) was proposed in [18]. It uses multivariate highly based on game theory. It discards the instances with higher

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

6 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

probability of belonging to the opposite class, based on each in- oversampling (techniques beyond SMOTE), and filtering methods.
stance neighborhood. We also show whether an automatic wrapper procedure is avail-
The package also includes the method oversample, which is able, and if it includes a visualization of the preprocessed output.
a wrapper that eases calls to the described and already existing From Table 1 we may conclude that imbalance is a com-
methods. plete solution with many relevant oversampling techniques. It is
To evaluate the oversampling process, we propose a visual by far the one with the largest number of approaches, and the only
method, called plotComparison. It plots a pairwise comparative one that includes oversampling approaches beyond the traditional
grid of a selected set of attributes, both in the original dataset and SMOTE scheme. Finally, we must stress the relevance of the visual-
the oversampled one. That way, if a proper oversampling has been ization feature, which can be very useful for practitioners to check
performed, we expect to see larger minority clusters in the result- the areas where the minority class is mainly reinforced (Table 2).
ing dataset.
In addition to this, imbalance includes some datasets 3. Examples of use
from the KEEL [6,20] repository (http://www.keel.es/datasets.php),
which can be used to perform experiments. Additional datasets can The following example loads the dataset newthyroid1, in-
be easily imported under a single constraint: they must contain cluded in our package, and applies the algorithm PDFOS to the
a class column (not necessarily the last one) having two different dataset, requesting 80 new instances. newthyroid1 is a classi-
values. cal dataset that has a series of 5 different medical measurements
To conclude this section, we show in Table 1 a comparison as attributes, and classifies every patient in hyperthyroidism (42
of the main features of this novel imbalance package with instances) or non-hyperthyroidism (173 instances). Once the algo-
respect to the previous software solutions available at CRAN: rithm has been applied, we plot a pairwise visual comparison be-
unbalanced [8], smotefamily [9], and rose [10,11]. Among tween first three attributes of the original and modified datasets.
the properties that are contrasted, we included the latest date of Result can be observed in Fig. 1, and in Fig. 2, after applying filter-
release, the number of preprocessing techniques that are avail- ing.
able, and if it includes different approaches, namely undersam-
pling, oversampling, SMOTE (and variants/extensions), advanced

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 7

# Load the previously installed imbalance package


library ( " imbalance " )
# Load the dataset newthyroid1 , included in imbalance
data ( newthyroid1 )

# Compute the imbalance ratio of newthyroid1 ( that is ,


# proportion of minority examples with respect to majority
ones )
imbalanceRatio ( newthyroid1 )
# 0.1944444

# Generate 80 new minority instances for newthyroid1 using


# pdfos algorithm
newSamples <- pdfos (
newthyroid1 ,
numInstances = 80
)

# Add the new samples to the original newthyroid1 dataset and


# asssign it to a newDataset variable
newDataset <- rbind (
newthyroid1 ,
newSamples
)

# Compare the three first variables of the extended dataset


# and the original one
plotComparison (
newthyroid1 ,
newDataset ,
attrs = names ( newthyroid1 ) [1:3]
)

# Filter synthetic examples from newSamples , so that


# only relevant ones remain in the dataset
filteredSamples <- neater (
newthyroid1 ,
newSamples ,
iterations = 500
)

# Add the new filtered samples to the original dataset


filteredNewDataset <- rbind (
newthyroid1 ,
filteredSamples
)

# Compare the three first variables of the extended filtered


# dataset and the original one
plotComparison (
newthyroid1 ,
filteredNewDataset ,
attrs = names ( newthyroid1 ) [1:3]
)

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

8 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

The banana dataset, which has been included in the package, Algorithm 2 RACOG oversampling.
is a binary dataset with 5300 samples and two attributes (apart
Require: S = {x1 , . . . ,xm }, positive examples
from class attribute), artificially developed to represent a banana
Require: β , burnin
when plotted in two dimensions with each class filled with a dif-
Require: α , lag
ferent color. A straightforward Random Undersampling has yield
Require: T , requested number of synthetic examples
an imbalance dataset (with 10% imbalance ratio) from the origi-
1: P = approximation of the S distribution
nal dataset, both observable in Figs. 3 and 4, respectively. Origi-
2: S = ∅
T 
nal dataset has been included as banana-orig in the package;
3: M = m · α + β
imbalanced one, as banana. We provide a visual comparison be-
4:
tween results of applying an oversample to reach 50% of imbalance
5: for t = 1, . . . , M do
ratio in banana, and a later filtering using NEATER. Applied over-
6: S = GibbsSampler(S, P )
sampling techniques are SMOTE (Fig. 5), MWMOTE (Fig. 6), RWO
7: if t > β and t mod (α ) = 0 then
(Fig. 7) and PDFOS (Fig. 8).
8: S = S ∪ S
9: end if
Algorithm 1 MWMOTE oversampling. 10: end for
Require: S+ = {x1 , . . . ,xm }, minority instances 11:
Require: S− = {y1 , . . . ,ym }, majority instances 12: S = Pick T random instances from S
Require: T , requested number of synthetic examples 13: return S , synthetic examples
Require: k1 , KNN parameter to filter noisy instances of S+
Require: k2 , KNN parameter to compute boundary U ⊆ S−
Require: K3 , KNN parameter to compute boundary V ⊆ S+
Algorithm 3 wRACOG oversampling.
Require: α , tolerance for the closeness level to the borderline
Require: Strain = {zi = (xi , yi )}m
i=1
, train instances
Require: C, weight of the closeness factor to the borderline
Require: Sval , validation instances
Require: Cclust
Require: wrapper
1: Initialize S = ∅
Require: T , requested number of synthetic examples
2: For each x ∈ S+ , compute its k1 KNN neighbourhood, N N k1 (x )
Require: α , tolerance parameter
3: Let S = S+ − {x ∈ S+ : N N k1 (x ) ∩ S+ = ∅}
+
+
f
 1: S = S
k train
4: Compute U = N N−2 (x ) 2: P = approximation of the S distribution
x∈S +
f 3: Build a model with wrapper and Strain

5: Compute V =
k
N N+3 (x ) 4: Initialize S = ∅
x∈U  5: Initialize τ = (+∞, . . . , +∞ )
6: For each x ∈ V , compute P (x ) = y∈U Iα ,C (x, y ) 1) T)
6:
7: Normalize P (x ) for each x ∈ V , P (x ) =  P (xP)(z )
z∈V 7: while The standard deviation of τ ≥ α do
8: Compute 8: S = GibbsSampler(S, P )
1  9: S ⊇ Smisc = misclassified examples by model
Tclust = Cclust · min d (x, y ) Update S = S ∪ Smisc
|S+f | x∈S+ y∈S+f ,y
=x 10:
f 11: Update train set Strain = Strain ∪ Smisc
12: Build new model with wrapper, Strain
9: Let L1 , . . . ,LM ⊆ S+ be the clusters for S+ , with Tclust as thresh- 13: Let s = sensitivity of model over Sval
old 14: Let τ = (τ2 , . . . , τT , s )
10:
15: end while
11: for t = 1, . . . , T do 16:
12: Pick x ∈ V with respect to P (x ) 17: return S , synthetic examples
13: Uniformly pick y ∈ Le inside Le x, where x ∈ Le
14: Uniformly pick r ∈ [0, 1]
15: S  = S  ∪ {x + r ( y − x )}
16: end for for researchers and practitioners. Secondly, to include a visualiza-
17:
tion environment for the sake of observing those areas of the mi-
18: return S , synthetic examples nority class that are actually reinforced by means of preprocessing.
Finally, to enable a simpler integration of these methods with the
existing oversampling packages at CRAN.
As future work, we propose to keep maintaining and adding
4. Conclusions functionality to our new imbalance package. Specifically, we
plan to include those new oversampling techniques that are reg-
Class imbalance in datasets is one of the most decisive factors ularly proposed in the specialized literature. In this sense, we con-
to take into account when performing classification tasks. To im- sider that there are good prospects to improve the software in the
prove the behavior of classification algorithms in imbalanced do- near future.
mains, one of the most common and effective approaches is to ap-
ply oversampling as a preprocessing approach. It works by creating
synthetic minority instances to increase the number of representa-
tives belonging to that class.
In this paper we have presented the imbalance package for Acknowledgments
R. It was intended to alleviate some drawbacks that arise in cur-
rent software solutions. Firstly, to provide useful implementations This work is supported by the Project BigDaP-TOOLS - Ayudas
for those novel oversampling methods that were not yet available Fundación BBVA a Equipos de Investigación Científica 2016.

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 9

Algorithm 4 RWO oversampling. Algorithm 6 NEATER filtering.


Require: S = {xi = (i )
( w1 , . . . wd )}m(i )
,
positive instances Require: S = {z1 = (x1 , y1 ), . . . zn = (xn , yn )}, original dataset
i=1
Require: T , required number of instances Require: S = {z̄1 = (x̄1 , ȳ1 ), . . . z̄m = (x̄m , ȳm )}, positive instances
1: Initialize S = ∅ Require: k, number of KNN neighbours.
2: Require: T , required number of iterations.
3: for For each j = 1, . . . , d do Require: α , smooth factor.
4: if j−ith 
attribute is numerical then 1: Initialize E = ∅
 m 2 2: For each xi ∈ S , compute its neighbourhood N N k (xi ) ⊆ S ∪ S
(i )
m i=1 w j 3: For i = 1, . . . , n initialize δi = (1, 0 ) if yi = 1 and δi = (0, 1 ) if
5: σ j = 1
m i=1 w(ji ) − m
yi = −1
6: end if 4: For i = n + 1, . . . , n + m do δi = (0.5, 0.5 )
7: end for 5:
8: 6: for t = 1, . . . , T do
9: Assign M = T /m 7: for i = 1, . . . , m do
10: for t = 1, . . . , M do 8: Compute total payoff:
11: for i = 1, . . . , m do 
12: for j = 1, . . . , d do ui = g(d (x̄i , x j )) · δi · δ Tj
x j ∈N N k ( xi )
13: if j−th attribute is numerical then
14: Choose r ∼ N (0, 1 )
σ 9: Compute positive payoff:
15: w j = w(ji ) − √j
m
·r 
16: else u= g(d (x̄i , x j )) · (1, 0 ) · δ Tj
17: Choose w j uniformly over {w(j1 ) , . . . w(jm ) } x j ∈N N k ( xi )

18: end if
10: Assign α = (α + u )/(α + un+1 )
19: end for
11: Update δn+1 = (α , 1 − α )
20: S = S ∪ {(w1 , . . . ,wd )}
12: end for
21: end for
13: end for
22: end for
14:
23:
15: for i = 1, . . . , m do
24: S = Choose T random instances from S
16: if δi1 > 0.5 then
25: return S , synthetic positive instances
17: E = E ∪ {(x̄i , 1 )}
18: end if
Algorithm 5 PDFOS oversampling. 19: end for
20:
Require: S = {xi = (w1(i ) , . . . wd(i ) )}m , positive instances
i=1 21: return E ⊆ S , filtered synthetic positive instances
Require: T , required number of instances
1: Initialize S = ∅
2: Search for h = which minimizes M (h )
Then, Sections B.2–B.6 include the main characteristics of the algo-
3: Find U unbiased covariance matrix of S
rithms together with an explanation of the upsides and downsides
4: Compute U = R · RT with Choleski decomposition
of every oversampling approach.
5:
6: for i = 1, . . . , T do
7: Choose x ∈ S B1. Classification task
8: Pick r with respect to a normal distribution, i.e. r ∼ N d (0, 1 )
9: S = S ∪ {x + hrR} Classification problem is one of the best known problems in the
10: end for machine learning framework. We are given a set of training in-
11: stances, namely, S = {(x1 , y1 ), . . . , (xm , ym )} where xi ∈ X ⊂ Rn and
12: return S , synthetic positive instances yi ∈ Y, with Y = {0, 1}. X and Y will be called domain and label set,
respectively. The training set will be considered as independent and
identically distributed (i.i.d.) samples taken with respect to an un-
known probability P over the set X × Y, denoting that as S ∼ P m .
Appendix A. Installation

To install our package, R language is needed (see https://www.


r-project.org/ for further indications on how to install it). Once R
language is properly installed, it suffices to do, in an R interpreter:

install . packages ( " imbalance " )


This command will install the latest version of the package di-
rectly from CRAN, which is the official repository for R packages.
A test set will be a set of i.i.d. instances T =
Appendix B. Description of the algorithms in the package {(x̄1 , ȳ1 ), . . . , (x̄k , ȳk )}, (x̄i , ȳi ) ∈ X × Y, taken with respect to
P. We denote Tx := {x̄1 , . . . , x̄k }. A classifier hS (depending on
Hereafter, the new preprocessing techniques included in the training set S) is a function th at takes an arbitrary test set,
the imbalance package will be further described. To do so, lacking the labels ȳi , and outputs labels for each instance. That
Section B.1 first provides a short introduction on classification task. is, hS (Tx ) = {(x̄1 , ȳ¯ 1 ), . . . , (x̄k , ȳ¯ k )}. The aim of the classification

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

10 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

problem is to find the classifier that minimizes the labeling error A few interesting considerations:
LP with respect to the true distribution of the data, defined as
• Low k2 is required in order to ensure we do not pick too many
follows:
negative instances in U.
LP (hS ) := E 1[hS ({x} )
= {(x, y )}] • For an opposite reason, a high k3 must be selected to ensure
{(x,y )}∼P
we pick as many positive hard-to-learn borderline examples as
Where 1[Condition] returns 1 if Condition holds, and 0 oth- we can.
erwise. Hence, LP represents an average of the error over the do- • The higher the Cclust parameter, the less and more-populated
main instances, weighted by the probability of extracting those in- clusters we will get.
stances.
Since we do not know the true distribution of the data, we will B2.1. Pros and cons
usually approximate LP (hS ) using the average error over a test set The most evident gain of this algorithm is that it fixes some of
T = {(x̄1 , ȳ1 ), . . . , (x̄k , ȳk )}: the weaknesses of SMOTE. And SMOTE is still one of the main ref-

k erences that researches use as a benchmark to compare their algo-
L ( hS ) = 1[hS ({x̄i , } )
= {(x̄i , ȳi )} rithms. That makes a MWMOTE a state-of-the-art algorithm. Apart
i=1 from that, and although the pseudocode can be quite confusing,
Specifically, imbalance package provides oversampling algo- the idea behind the algorithm is easy to understand.
rithms. Those family of procedures aim to generate a set E of syn- On the other hand, the algorithm relies on the idea that the
thetic positive instances based on the training ones, so that we space between two minority instances is going to belong to a mi-
have a new classification problem with S̄+ = S+ ∪ E, S̄− = S− and nority cluster, which seems like a reasonable hypothesis, but can
S̄ = S̄+ ∪ S̄− our new training set. lead to error in certain datasets (e.g., the minority class spread
across a large number of tiny clusters).
B2. MWMOTE
B3. RACOG and wRACOG
This algorithm, proposed by [15], is one of the many modifi-
cations of SMOTE [12], which is a classic algorithm to treat class These set of algorithms, proposed in [16], assume we want to
imbalance. SMOTE generates new examples by filling empty areas approximate a discrete distribution P (W1 , . . . , Wd ).
among the positive instances. It updates the training set iteratively, The key of the algorithm is to approximate P (W1 , . . . , Wd ) as
d
by performing: i=1 P (Wi | Wn (i ) ) where n (i ) ∈ {1, . . . , d }. Chow–Liu’s algorithm is
used to meet that purpose. This algorithm minimizes Kullback–
E := E ∪ {x + r · (y − x )}, x, y ∈ S+ , r ∼ N (0, 1 ) Leibler distance between two distributions:
But SMOTE has a clear downside: it does not detect noisy in- 
DKL (P  Q) = P (i )(log P (i ) − log Q (i ) )
stances. Therefore, it can generate synthetic examples out of noisy
i
ones or even between two minority classes, which if not cleansed
up, may end up becoming noise inside a majority class clus- We recall the definition for the mutual information of two ran-
ter. MWMOTE (Majority Weighted Minority Oversampling Technique) dom discrete variables Wi , Wj :
tries to overcome both problems. It intends to give higher weight 
  p( w 1 , w 2 )
to borderline instances, undersized minority clusters and examples I (Wi , W j ) = p(w1 , w2 ) log
p( w 1 ) p( w 2 )
near the borderline of the two clases. w1 ∈W1 w2 ∈W2
Let’s introduce some notations and definitions:
Let = {xi = (w1(i ) , . . . , wd(i ) )}m
S+ i=1
be the unlabeled positive in-
• d(x, y) stands for the euclidean distance between x and y. stances. To approximate the distribution, we do:
• NNk (x)⊆S will be the k-neighbourhood of x in S (k closest in-
stances with euclidean distance). • Compute G = (E  , V  ), Chow Liu’s dependence tree.
• N Nik (x ) ⊆ Si , i = +, − will be x’s k- minority (resp. majority) • If r is the root of the tree, define P(Wr |Wn(r) ) := P(Wr ).
neighbourhood.
• For each (u, v) ∈ E arc in the tree, n(v) := u and compute
P(Wv |Wn(v) ).
• C f (x, y ) = αC · f d (x,y
d
) measures the closeness of x to y, that
After that, a Gibbs Sampling scheme is used to extract samples
is, it will measure the proximity of borderline instances. where
with respect to the approximated probability distribution, where a
f = x1[x≤α ] + C1[x>α ]
C (x,y )
badge of new instances is obtained by performing:
• D f (x, y ) =  f will represent a density factor, such that
z∈V C f (z,y )
• Given a minority sample xk = (w1(i ) , . . . , wd(i ) )
an instance belonging to a compact cluster will have higher
• Iteratively construct for each attribute
 Cf (z, y) than another one belonging to a more sparse cluster.
• Iα ,C (x, y ) = C f (x, y ) · D f (x, y ), where if x ∈
/ N N+3 (y ) then
k w̄k(i ) ∼ P (Wk | w̄1(i ) , . . . , w̄k(i−1
)
, wk(i+1
)
, . . . , wd(i ) )
Iα ,C (x, y ) = 0.
• Return S = {x̄i = (w̄1(i ) , . . . , w̄d(i ) )}m .
 i=1
Let Tclust := Cclust · |S1+ | x∈S+ min d (x, y ). We will also use a
f f y∈S+ ,y
=x
f B3.1. RACOG
mean-average agglomerative hierarchical clustering of the minor- RACOG (Rapidly Converging Gibbs) builds a Markov chain for
ity instances with threshold Tclust , that is, we will use a mean dis- each of the m minority instances, ruling out the first β generated
tance: instances and selecting a badge of synthetic examples each α iter-
1  ations. That allows to lose dependence of previous values.
dist (Li , L j ) = d (x, y )
|Li ||L j | x∈L y∈L j
i
B3.2. wRACOG
and having started with a cluster per instance, we will proceed by RACOG depends on α , β and the requested number of in-
joining nearest clusters until minimum of distances is lower than stances. wRACOG (wrapper-based RACOG) tries to overcome that
Tclust . problem. Let wrapper be a binary classifier.

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 11

B3.3. Pros and cons an estimator for f could be the mean number of samples in ]x −
Clearly, RACOG and wRACOG have the advantage that they h, x + h[ (let’s call this number Ih (x1 , . . . , xn )) divided by the length
are highly based on statistical evidence/procedures, and they have of the interval:
guarantees to succeed in their goal. On the contrary, there exists a
 I ( x1 , . . . , xn )
substantial downside of those algorithms: they only work on dis- f (x ) = h
2hn
crete variables, which makes them very restrictive with respect to 
1
|x| < 1  !
the set of data they can be applied to. If we define ω (x ) = 2 and wh (x ) = w x
h
, then
0 otherwise
B4. RWO we could write 
f as:
1 
n
RWO (Random Walk Oversampling) is an algorithm introduced

f (x ) = ωh ( x − xi )
nh
by [17], which generates synthetic instances so that mean and de- i=1
viation of numerical attributes remain as close as possible to the In d dimensional case, we define:
original ones.
1 
n
This algorithm is motivated by the central limit theorem, 
f (x ) = ωh ( x − xi )
which states that given a collection of independent and identi- nhd
i=1
cally distributed random variables, W1 , . . . , Wm , with E(Wi ) = μ
and V ar (Wi ) = σ 2 < ∞, then: B5.2. Kernel methods
⎡ ⎛ ⎞ ⎤ If we took w = 12 1]−1,1[ , then 
f would have jump discontinu-
ities and we would have jump derivatives. On the other hand, we
⎢√ ⎜  ⎟ ⎥ "
⎢ m⎜ 1
m
⎟ ⎥ could took ω, where w ≥ 0,
ω (x )dx = 1,
⊆X a domain, and w
lim P ⎢ ⎜ Wi −μ⎟ ≤ z⎥ = φ (z ) even, and that way we could have estimators with more desirable
m ⎣ σ ⎝ m i=1 ⎠ ⎦
   properties with respect to continuity and differentiability.

W f can be evaluated through its MISE (Mean Integral Squared Er-
ror):
where φ is the distribution function of N(0, 1). #
That is, σW/−μ
√ → N (0, 1 ) probability-wise.
m
MISE (h ) = E (
f (x ) − f (x ))2 dx
x1 ,...,xd
Let S+ = {xi = (w1(i ) , . . . , wd(i ) )}m
i=1
be the minority instances.
Now, let’s fix some j ∈ {1, . . . , d}, and let’s assume that j- B5.3. The algorithm
ith column follows a numerical random variable Wj , with PDFOS (Probability Distribution density Function estimation based
mean μj and standard deviation σ j < ∞. Let’s compute σ j = Oversampling) was proposed in [18]. It uses multivariate Gaussian
  2 kernel methods. The probability density function of a d-Gaussian
m (i )
m i=1 w j distribution with mean 0 and as its covariance matrix is:
1
m i=1 w(ji ) − m the biased estimator for the standard

1 1
deviation. It can be proven that instances generated with w̄ j =
φ (x ) = $ exp − x −1 xT
( 2π · det ( )) d 2
σ
w(ji ) − √j · r, r ∼ N (0, 1 ) have the same sample mean as the origi-
m Let = {xi = (w1(i ) , . . . , wd(i ) )}m
S+ i=1
be the minority instances. The
nal ones, and their sample variance tends to the original one. unbiased covariance estimator is:
1  1 
m m
B4.1. Pros and cons U= (xi − x )(xi − x )T , where x = xi
RWO, as it was originally described in [17], uses the sample m−1 m
i=1 i=1
variance, instead of the unbiased sample variance. Therefore we x!
We will use kernel functions φh (x ) = , where h needs to φU
only have guarantees that E(σ j2 ) −→ σ j2 . If we had picked τ j = h
m→∞ be optimized to minimize the MISE. We will pursue to minimize
m

m
2
1
m−1 i=1 w(ji ) − 1
m i=1 w(ji ) , instead, E(τ j ) = σ j2 would hold. the following cross-validation function:
1  ∗
m m
Another downside of the algorithm is its arbitrariness when it 2
M (h ) = φ h ( x i − x j ) + d φh ( 0 )
comes to non-numerical variables. m2 h d mh
i=1 j=1
The most obvious upside of this algorithm is its simplicity and
its good practical results. where φh∗ ≈ φh√2 − 2φh .
Once a proper h has been found, a suitable generating scheme
B5. PDFOS could be to take xi + hRr, where xi ∈ S+ , r ∼ Nd (0, 1) and U =
R · RT , R being upper-triangular. In case we have enough guaran-
Due to the complexity of this preprocessing technique, in this tees to decompose U = R · RT (U must be a positive-definite ma-
case we will structure the description of its working procedure into trix), we could use Choleski decomposition. In fact, we provide a
several subsections, providing the motivation, background tech- sketch of proof showing that all covariance matrices are positive-
semidefinite:
niques, and finally pros and cons. % &

m

B5.1. Motivation yT (xi − x )(xi − x )T y


Given a distribution function of a random variable X, namely i=1

F(x), if that function has an almost everywhere derivative, then, al- 


m 
m

most everywhere, it holds: = ( ( xi − x )T y )T ( xi − x )T y ) = ||zi ||2 ≥ 0


i=1
      i=1
P (x − h < X ≤ x + h ) ziT zi
f (x ) = F  (x ) = lim
h→0 2h for arbitrary y ∈ Rd . Hence, we need a strict positive definite ma-
Given a fixed h, that we will call bandwidth henceforth. Given a trix, otherwise PDFOS would not provide a result and will stop its
collection of random samples X, X1 , . . . , Xn , namely x1 , . . . , xn , then execution.

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

12 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

B5.4. Search of optimal bandwidth We define 1 ×  × n := and we call an element δ =


We take a first approximation to h as the value: (δ1 , . . . , δn ) ∈ an strategy profile. Having a fixed strategy profile

d+4
1 δ , the overall payoff for the ith player is defined as:
4 
hSil verman = ui ( δ ) = δi(ti ) fi (t )
m (d + 2 )
(t1 ,...,tn )∈T
where d is number of attributes and m the size of the minority
class. Given ui the payoff for a δ strategy profile in the i-th player and
Reshaping the equation of the cross validation function and dif- δ ∈ , we will denote
ferentiating: δ−i := (δ1 , . . . , δi−1 , δi+1 , . . . , δn ) (B.1)
2 
m
1
M (h ) = φh∗ (xi − x j ) + φ √ (0 )
m2 hd mhd h 2
j>i ui (δi , δ−i ) := ui (δ ) (B.2)
dh−1 √
M  (h ) = − φ (0 ) A probabilistic Nash equilibrium is a strategy profile x =
mhd h 2 (δ1 , . . . , δn ) verifying ui (δi , δ−i ) ≥ ui (δi , δ−i ) for every other δ  ∈ ,
2
'm
and all i = 1, . . . , n.
− 2 d φh∗ (xi − x j )dh−1 A theorem ensures that every game space (P, T, f) with finite
m h
j>i
players and strategies has a probabistic Nash equilibrium.

m (
+ φh∗ (xi − x j )h−3 (xi − x j )T U (xi − x j )
B6.2. The algorithm
j>i
NEATER (filteriNg of ovErsampled dAta using non cooperaTive
And we use a straightforward gradient descent algorithm to gamE theoRy), introduced in [19], is a filtering algorithm based on
find a good h estimation. game theory. Let S be the original training set, E the synthetic gen-
erated instances. Our players are S ∪ E. Every player is able to pick
B5.5. Pros and cons between two different strategies: being a negative instance (0) or
On the one hand, PDFOS makes the assumption that the data being a positive instance (1). Players of S have a fixed strategy,
can be locally approximated by a normal distribution. What is where the ith player would have δi = (0, 1 ) (a 0 strategy) in case
more, it makes the assumption that the same bandwidth gives it is a negative instance or δi = (1, 0 ) (a 1 strategy) otherwise.
good local results in every single point. Another disadvantage of The payoff for a given instance is affected only by its own strat-
egy and its k nearest neighbors in S ∪ E. That is, for every  xi ∈!E, we
the algorithm is that not every single covariance matrix has a 
Choleski decomposition (a covariance matrix can be shown to be will have ui (δ ) = j∈NNk (x ) (xTi wi j x j ) where wi j = g d (xi , x j ) and
positive-semidefinite, whereas for the Choleski decomposition to g is a decreasing function (greater distances imply a lower payoff).
exist it needs to be a positive-definite matrix). In our implementation, we have considered g(z ) = 1+1z2 , with d the
On the other hand, although it makes some hypothesis, they are euclidean distance.
mild assumptions compared to the results it yields. It also has an Each step should involve an update to the strategy profiles of
enormous theoretical component, which ensures quality results. instances of E. Namely, if xi ∈ E, the following equation will be
used:

1 1
B6. NEATER
δi (0 ) = ,
2 2
Once we have created synthetic examples, we should ask α + ui ((1, 0 ))
δi,1 (n + 1 ) = δ (n )
ourselves how many of those instances are in fact relevant to α + ui (δ (n )) i,1
our problem. Filtering algorithms can be applied to oversampled
δi,2 (n + 1 ) = 1 − δi,1 (n + 1 )
datasets, to erase the least relevant instances.
That is, we reinforce the strategy that is producing the higher
B6.1. Game theory payoff, in detriment to the opposite strategy. This method has
Let (P, T, f) be our game space. We would have a set of play- enough convergence guarantees.
ers, P = {1, . . . , n}, and Ti = {1, . . . , ki }, set of feasible strategies for
References
the ith player, resulting in T = T1 × · · · × Tn . We can easily assign
a payoff to each player taking into account his/her own strategy [1] A. Fernández, S. del Río, N.V. Chawla, F. Herrera, An insight into imbalanced big
as well as other players’ strategies. So f is given by the following data classification: outcomes and challenges, Complex Intell. Syst. 3 (2) (2017)
equation: 105–120, doi:10.1007/s40747- 017- 0037- 9.
[2] Q. Zou, S. Xie, Z. Lin, M. Wu, Y. Ju, Finding the best classification threshold in
f :T −→ Rn imbalanced classification, Big Data Res. 5 (2016) 2–8, doi:10.1016/j.bdr.2015.12.
t −→ ( f1 (t ), . . . , fn (t )) 001.
[3] B. Krawczyk, M. Woźniak, G. Schaefer, Cost-sensitive decision tree ensembles
for effective imbalanced classification, Appl. Soft Comput. 14 (2014) 554–562,
t−i will denote (t1 , . . . , ti−1 , ti+1 , . . . , tn ) and similarly we can de- doi:10.1016/j.asoc.2013.08.014.
note fi (ti , t−i ) = fi (t ). [4] P. Zhou, X. Hu, P. Li, X. Wu, Online feature selection for high-dimensional class-
An strategic Nash equilibrium is a tuple (t1 , . . . , tn ) where imbalanced data, Knowl. Based Syst. 136 (2017) 187–199, doi:10.1016/j.knosys.
fi (ti , t−i ) ≥ fi (ti , t−i ) for every other t ∈ T, and all i = 1, . . . , n. That
2017.09.006.
[5] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data
is, an strategic Nash equilibrium maximizes the payoff for all the Eng. 21 (9) (2009) 1263–1284, doi:10.1109/tkde.2008.239.
players. [6] I. Triguero, S. González, J.M. Moyano, S. García, J. Alcalá-Fdez, J. Luengo, A. Fer-
nández, M.J. del Jesús, L. Sánchez, F. Herrera, KEEL 3.0: an open source soft-
The strategy for each player will be picked with respect to a
ware for multi-stage analysis in data mining, Int. J. Comput. Intell. Syst. 10 (1)
given probability: (2017) 1238, doi:10.2991/ijcis.10.1.82.
[7] G. Lemaitre, F. Nogueira, C.K. Aridas, Imbalanced-learn: a python toolbox to

ki
tackle the curse of imbalanced datasets in machine learning, J. Mach. Learn.
δi ∈ i = {(δi(1) , . . . , δi(ki ) ) ∈ (R+0 )ki : δi( j ) = 1} Res. 18 (2017) 1–5.
j=1
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 13

[8] A.D. Pozzolo, O. Caelen, S. Waterschoot, G. Bontempi, Racing for unbalanced [14] S. Das, S. Datta, B.B. Chaudhuri, Handling data irregularities in classification:
methods selection., in: H. Yin, K. Tang, Y. Gao, F. Klawonn, M. Lee, T. Weise, foundations, trends, and future challenges, Pattern Recognit. 81 (2018) 674–
B. Li, X. Yao (Eds.), Lecture Notes in Computer Science, 8206, Springer, 2013, 693, doi:10.1016/j.patcog.2018.03.008.
pp. 24–31. [15] S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted minority
[9] W. Siriseriwan, Smotefamily: A Collection of Oversampling Techniques for Class oversampling technique for imbalanced data set learning, IEEE Trans. Knowl.
Imbalance Problem Based on SMOTE, 2018, https://cran.r-project.org/package= Data Eng. 26 (2) (2014) 405–425, doi:10.1109/tkde.2012.232.
smotefamily. [16] B. Das, N.C. Krishnan, D.J. Cook, RACOG and wRACOG: two probabilistic over-
[10] N. Lunardon, G. Menardi, N. Torelli, ROSE: a package for binary imbalanced sampling techniques, IEEE Trans. Knowl. Data Eng. 27 (1) (2015) 222–234,
learning, R J. (2014) 8292. doi:10.1109/tkde.2014.2324567.
[11] G. Menardi, N. Torelli, Training and assessing classification rules with im- [17] H. Zhang, M. Li, RWO-sampling: a random walk over-sampling approach to im-
balanced data, Data Min. Knowl. Discov. 28 (2014) 92122, doi:10.1007/ balanced data classification, Inf. Fusion 20 (2014) 99–116, doi:10.1016/j.inffus.
s10618- 012- 0295- 5. 2013.12.003.
[12] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minor- [18] M. Gao, X. Hong, S. Chen, C.J. Harris, E. Khalaf, PDFOS: PDF estimation based
ity over-sampling technique, J. Arif. Intell. Res. 16 (2002) 321–357, doi:10.1613/ over-sampling for imbalanced two-class problems, Neurocomputing 138 (2014)
jair.953. 248–259, doi:10.1016/j.neucom.2014.02.006.
[13] A. Fernandez, S. Garcia, F. Herrera, N.V. Chawla, Smote for learning from imbal- [19] B.A. Almogahed, I.A. Kakadiaris, NEATER: filtering of over-sampled data using
anced data: progress and challenges. Marking the 15-year anniversary, J. Arif. non-cooperative game theory, Soft Comput. 19 (11) (2014) 3301–3322, doi:10.
Intell. Res. 61 (2018) 863–905. 1109/ICPR.2014.245.
[20] J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel
data-mining software tool: data set repository, integration of algorithms and
experimental anlysis framework, J. Mult.-Valued Log. Soft Comput. 17 (2010)
255–287.

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035

You might also like