0% found this document useful (0 votes)

47 views

Oversampling Algorithms in R

The document introduces a new R package called "imbalance" for dealing with imbalanced classification problems. The package includes recent oversampling algorithms to rebalance imbalanced datasets prior to performing machine learning. It contains 12 techniques, including some of the most recent approaches not currently available in other R packages. The package aims to provide state-of-the-art methods for preprocessing imbalanced data as well as tools to visualize the effects of the oversampling algorithms. An example using the package on a thyroid disease dataset is presented to illustrate its use.

Uploaded by

saumal20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Oversampling Algorithms in R

Uploaded by

saumal20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

JID: KNOSYS

ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

Knowledge-Based Systems 0 0 0 (2018) 1–13

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Imbalance: Oversampling algorithms for imbalanced classiﬁcation in R

Ignacio Cordón, Salvador García∗, Alberto Fernández, Francisco Herrera
DaSCI Andalusian Institute of Data Science and Computational Intelligence, University of Granada, Spain

a r t i c l e i n f o a b s t r a c t

Article history: Addressing imbalanced datasets in classification tasks is a relevant topic in research studies. The main
Received 1 March 2018 reason is that for standard classification algorithms, the success rate when identifying minority class
Revised 14 June 2018
instances may be adversely affected. Among different solutions to cope with this problem, data level
Accepted 25 July 2018
techniques have shown a robust behavior. In this paper, the novel imbalance package is introduced.
Available online xxx
Written in R and C++, and available at CRAN repository, this library includes recent relevant oversampling
Keywords: algorithms to improve the quality of data in imbalanced datasets, prior to performing a learning task. The
Oversampling main features of the package, as well as some illustrative examples of its use are detailed throughout this
Imbalanced classification manuscript.
Machine learning
Preprocessing © 2018 Elsevier B.V. All rights reserved.
SMOTE

1. Introduction First, regarding Java we may ﬁnd a complete module of imbal-

anced classification in the KEEL software suite [6]. It comprises
The imbalance classification problem is probably one of the a very complete collection of external and internal approaches, as
most researched problems in the machine learning framework [1– well as a large number of ensemble methods that work at both
5]. Its classical definition is a binary classification problem where levels.
we are given a set of training instances, labeled with two possible For Python, there exists a very recent tool-box named as
classes and a set of unlabeled instances, namely test set, to classify imbalanced-learn [7]. Similar to KEEL, it includes solutions
them using the information provided by the former one. When the based on preprocessing and ensemble learning.
size of a class, which usually represents the most important con- Finally, for R we may find several packages at CRAN which in-
cept to predict, is much lower than the other one, we have an im- clude oversampling and undersampling methods. Specifically, we
balance classification problem. must refer to unbalanced [8], smotefamily [9], and rose
In these cases, standard classification learning algorithms may [10,11].
be biased toward the majority class examples. In order to address However, there are two main issues associated with the
this issue, a common approach is to apply a preprocessing stage aforementioned software solutions. On the one hand, only
for rebalancing the training data. This can be carried out either imbalanced-learn allows a straightforward representation of
by undersampling, i.e. removing majority class instances, or over- the preprocessed datasets. This fact is very important in order to
sampling, i.e., introducing new minority class instances. Since un- acknowledge the actual areas that are reinforced for the minority
dersampling may remove some relevant instances, oversampling is class examples. On the other hand, and possibly the most signifi-
usually preferred. Additionally, this procedure is intended to rein- cant point, we have observed that none of them contain the latest
force the concept represented by the minority class, so that the approaches proposed in the specialized literature.
learning algorithm will be guided to avoid misclassifying these ex- In this paper, we present a novel, robust and up-to-date R pack-
amples. age including preprocessing techniques for the imbalance classifi-
Plenty of excellent oversampling algorithms arise every day in cation. Named as imbalance, it aims to provide both the state-
the scientific literature, but the software is rarely released. To the of-the-art methods for oversampling algorithms, as well as some
best of our knowledge, there are just few open source libraries and of the most recent techniques which still lack an implementation
packages that include methods and techniques related to imbal- in the R language. In this sense, we intend to provide a significant
anced classification. contribution to the already available tools that address the same
problem.
To present this novel package, the rest of the manuscript
∗
Corresponding author. is arranged as follows. First, Section 2 presents the soft-
E-mail address: [email protected] (S. García).

https://doi.org/10.1016/j.knosys.2018.07.035
0950-7051/© 2018 Elsevier B.V. All rights reserved.

Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classiﬁcation in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

2 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

Fig. 1. PDFOS applied to newthyroid1 dataset.

Table 1
Comparison of the proposed imbalance package to the available R packages for imbal-
anced classiﬁcation.

Property Imbalance Unbalanced Smotefamily Rose

Version 1.0.0 2.0 1.2 0.0-3
Date 2018-02-18 2015-06-26 2018-01-30 2014-07-15
#Techniques 12 9 6 1
√
Undersampling ✗ ✗ ✗
√ √ √ √
Oversampling
√ √ √
SMOTE (& var.) ✗
√
Advanced OverS. ✗ ✗ ✗
√ √
Filtering ✗ ✗
√ √
Wrapper ✗ ✗
√
Visualization ✗ ✗ ✗

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 3

Fig. 2. PDFOS + NEATER applied to newthyroid1.

Table 2
Code metadata (mandatory).

Nr. Code metadata description Please ﬁll in this column

C1 Current code version 1.0.0

C2 Permanent link to code/repository used of this code version github.com/ncordon/imbalance
C3 Legal Code License GPL ( ≥ 2)
C4 Code versioning system used git
C5 Software code languages, tools, and services used R ( ≥ 3.3.0), C++
C6 Compilation requirements, operating environments and dependencies Rcpp, bnlearn, KernelKnn, ggplot2, mvtnorm
C7 If available Link to developer documentation/manual ncordon.github.io/imbalance
C8 Support email for questions [email protected]

ware framework and enumerates the implemented algorithms. 2. Software

Then, Section 3 shows some illustrative examples. Finally,
Section 4 presents the conclusions. The significance of oversampling techniques in imbalanced clas-
sification is beyond all doubt. The main reason is related to a
smart generation of new artificial minority samples in those ar-
eas that need reinforcement for the learning of class-fair classifiers.

4 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

Fig. 3. Original banana dataset.

Fig. 4. Imbalanced banana dataset.

Fig. 5. SMOTE applied to imbalanced banana.

Since the original SMOTE algorithm in 2002 [12], many different original SMOTE algorithm [12,13]. It assigns higher weight to
approaches have been designed to improve the classiﬁcation per- borderline instances, undersized minority clusters and exam-
formance under different scenarios [13,14]. In this new developed ples near the borderline of the two classes.
software package, we have compiled some of the newest oversam- • racog, wracog. Rapidly Converging Gibbs (RACOG) and
pling algorithms, which are listed below: wrapper-based RACOG (wRACOG), both proposed by [16], work
for discrete attributes. They generate new examples with re-
• mwmote. The Majority Weighted Minority Oversampling Tech- spect to an approximated distribution using a Gibbs Sampler
nique (MWMOTE), ﬁrst proposed in [15], is an extension of the scheme. RACOG needs the number of instances to generate be-

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 5

Fig. 6. MWMOTE applied to imbalanced banana.

Fig. 7. RWO applied to imbalanced banana.

Fig. 8. PDFOS applied to imbalanced banana.

forehand. wRACOG requires a target classifier to show no im- Gaussian kernel methods to locally approximate the minority
provement to stop generating examples. class.
• rwo. Random Walk Oversampling (RWO) is an algorithm in-
troduced by [17], which generates synthetic instances so that
mean and deviation of numerical attributes remain close to the Apart from those oversampling methods, we provide a filter-
original ones. ing method called neater. The filteriNg of ovErsampled dAta us-
• pdfos. Probability Distribution density Function estimation based ing non cooperaTive gamE theoRy (NEATER), introduced in [19], is
Oversampling (PDFOS) was proposed in [18]. It uses multivariate highly based on game theory. It discards the instances with higher

6 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

probability of belonging to the opposite class, based on each in- oversampling (techniques beyond SMOTE), and filtering methods.
stance neighborhood. We also show whether an automatic wrapper procedure is avail-
The package also includes the method oversample, which is able, and if it includes a visualization of the preprocessed output.
a wrapper that eases calls to the described and already existing From Table 1 we may conclude that imbalance is a com-
methods. plete solution with many relevant oversampling techniques. It is
To evaluate the oversampling process, we propose a visual by far the one with the largest number of approaches, and the only
method, called plotComparison. It plots a pairwise comparative one that includes oversampling approaches beyond the traditional
grid of a selected set of attributes, both in the original dataset and SMOTE scheme. Finally, we must stress the relevance of the visual-
the oversampled one. That way, if a proper oversampling has been ization feature, which can be very useful for practitioners to check
performed, we expect to see larger minority clusters in the result- the areas where the minority class is mainly reinforced (Table 2).
ing dataset.
In addition to this, imbalance includes some datasets 3. Examples of use
from the KEEL [6,20] repository (http://www.keel.es/datasets.php),
which can be used to perform experiments. Additional datasets can The following example loads the dataset newthyroid1, in-
be easily imported under a single constraint: they must contain cluded in our package, and applies the algorithm PDFOS to the
a class column (not necessarily the last one) having two different dataset, requesting 80 new instances. newthyroid1 is a classi-
values. cal dataset that has a series of 5 different medical measurements
To conclude this section, we show in Table 1 a comparison as attributes, and classifies every patient in hyperthyroidism (42
of the main features of this novel imbalance package with instances) or non-hyperthyroidism (173 instances). Once the algo-
respect to the previous software solutions available at CRAN: rithm has been applied, we plot a pairwise visual comparison be-
unbalanced [8], smotefamily [9], and rose [10,11]. Among tween first three attributes of the original and modified datasets.
the properties that are contrasted, we included the latest date of Result can be observed in Fig. 1, and in Fig. 2, after applying filter-
release, the number of preprocessing techniques that are avail- ing.
able, and if it includes different approaches, namely undersam-
pling, oversampling, SMOTE (and variants/extensions), advanced

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 7

# Load the previously installed imbalance package

library ( " imbalance " )
# Load the dataset newthyroid1 , included in imbalance
data ( newthyroid1 )

# Compute the imbalance ratio of newthyroid1 ( that is ,

# proportion of minority examples with respect to majority
ones )
imbalanceRatio ( newthyroid1 )
# 0.1944444

# Generate 80 new minority instances for newthyroid1 using

# pdfos algorithm
newSamples <- pdfos (
newthyroid1 ,
numInstances = 80
)

# Add the new samples to the original newthyroid1 dataset and

# asssign it to a newDataset variable
newDataset <- rbind (
newthyroid1 ,
newSamples
)

# Compare the three first variables of the extended dataset

# and the original one
plotComparison (
newthyroid1 ,
newDataset ,
attrs = names ( newthyroid1 ) [1:3]
)

# Filter synthetic examples from newSamples , so that

# only relevant ones remain in the dataset
filteredSamples <- neater (
newthyroid1 ,
newSamples ,
iterations = 500
)

# Add the new filtered samples to the original dataset

filteredNewDataset <- rbind (
newthyroid1 ,
filteredSamples
)

# Compare the three first variables of the extended filtered

# dataset and the original one
plotComparison (
newthyroid1 ,
filteredNewDataset ,
attrs = names ( newthyroid1 ) [1:3]
)

8 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

The banana dataset, which has been included in the package, Algorithm 2 RACOG oversampling.
is a binary dataset with 5300 samples and two attributes (apart
Require: S = {x1 , . . . ,xm }, positive examples
from class attribute), artificially developed to represent a banana
Require: β , burnin
when plotted in two dimensions with each class filled with a dif-
Require: α , lag
ferent color. A straightforward Random Undersampling has yield
Require: T , requested number of synthetic examples
an imbalance dataset (with 10% imbalance ratio) from the origi-
1: P = approximation of the S distribution
nal dataset, both observable in Figs. 3 and 4, respectively. Origi-
2: S = ∅
T
nal dataset has been included as banana-orig in the package;
3: M = m · α + β
imbalanced one, as banana. We provide a visual comparison be-
4:
tween results of applying an oversample to reach 50% of imbalance
5: for t = 1, . . . , M do
ratio in banana, and a later filtering using NEATER. Applied over-
6: S = GibbsSampler(S, P )
sampling techniques are SMOTE (Fig. 5), MWMOTE (Fig. 6), RWO
7: if t > β and t mod (α ) = 0 then
(Fig. 7) and PDFOS (Fig. 8).
8: S = S ∪ S
9: end if
Algorithm 1 MWMOTE oversampling. 10: end for
Require: S+ = {x1 , . . . ,xm }, minority instances 11:
Require: S− = {y1 , . . . ,ym }, majority instances 12: S = Pick T random instances from S
Require: T , requested number of synthetic examples 13: return S , synthetic examples
Require: k1 , KNN parameter to filter noisy instances of S+
Require: k2 , KNN parameter to compute boundary U ⊆ S−
Require: K3 , KNN parameter to compute boundary V ⊆ S+
Algorithm 3 wRACOG oversampling.
Require: α , tolerance for the closeness level to the borderline
Require: Strain = {zi = (xi , yi )}m
i=1
, train instances
Require: C, weight of the closeness factor to the borderline
Require: Sval , validation instances
Require: Cclust
Require: wrapper
1: Initialize S = ∅
Require: T , requested number of synthetic examples
2: For each x ∈ S+ , compute its k1 KNN neighbourhood, N N k1 (x )
Require: α , tolerance parameter
3: Let S = S+ − {x ∈ S+ : N N k1 (x ) ∩ S+ = ∅}
+
+
f
1: S = S
k train
4: Compute U = N N−2 (x ) 2: P = approximation of the S distribution
x∈S +
f 3: Build a model with wrapper and Strain

5: Compute V =
k
N N+3 (x ) 4: Initialize S = ∅
x∈U 5: Initialize τ = (+∞, . . . , +∞ )
6: For each x ∈ V , compute P (x ) = y∈U Iα ,C (x, y ) 1) T)
6:
7: Normalize P (x ) for each x ∈ V , P (x ) = P (xP)(z )
z∈V 7: while The standard deviation of τ ≥ α do
8: Compute 8: S = GibbsSampler(S, P )
1 9: S ⊇ Smisc = misclassified examples by model
Tclust = Cclust · min d (x, y ) Update S = S ∪ Smisc
|S+f | x∈S+ y∈S+f ,y
=x 10:
f 11: Update train set Strain = Strain ∪ Smisc
12: Build new model with wrapper, Strain
9: Let L1 , . . . ,LM ⊆ S+ be the clusters for S+ , with Tclust as thresh- 13: Let s = sensitivity of model over Sval
old 14: Let τ = (τ2 , . . . , τT , s )
10:
15: end while
11: for t = 1, . . . , T do 16:
12: Pick x ∈ V with respect to P (x ) 17: return S , synthetic examples
13: Uniformly pick y ∈ Le inside Le x, where x ∈ Le
14: Uniformly pick r ∈ [0, 1]
15: S = S ∪ {x + r ( y − x )}
16: end for for researchers and practitioners. Secondly, to include a visualiza-
17:
tion environment for the sake of observing those areas of the mi-
18: return S , synthetic examples nority class that are actually reinforced by means of preprocessing.
Finally, to enable a simpler integration of these methods with the
existing oversampling packages at CRAN.
As future work, we propose to keep maintaining and adding
4. Conclusions functionality to our new imbalance package. Specifically, we
plan to include those new oversampling techniques that are reg-
Class imbalance in datasets is one of the most decisive factors ularly proposed in the specialized literature. In this sense, we con-
to take into account when performing classification tasks. To im- sider that there are good prospects to improve the software in the
prove the behavior of classification algorithms in imbalanced do- near future.
mains, one of the most common and effective approaches is to ap-
ply oversampling as a preprocessing approach. It works by creating
synthetic minority instances to increase the number of representa-
tives belonging to that class.
In this paper we have presented the imbalance package for Acknowledgments
R. It was intended to alleviate some drawbacks that arise in cur-
rent software solutions. Firstly, to provide useful implementations This work is supported by the Project BigDaP-TOOLS - Ayudas
for those novel oversampling methods that were not yet available Fundación BBVA a Equipos de Investigación Científica 2016.

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 9

Algorithm 4 RWO oversampling. Algorithm 6 NEATER ﬁltering.

Require: S = {xi = (i )
( w1 , . . . wd )}m(i )
,
positive instances Require: S = {z1 = (x1 , y1 ), . . . zn = (xn , yn )}, original dataset
i=1
Require: T , required number of instances Require: S = {z̄1 = (x̄1 , ȳ1 ), . . . z̄m = (x̄m , ȳm )}, positive instances
1: Initialize S = ∅ Require: k, number of KNN neighbours.
2: Require: T , required number of iterations.
3: for For each j = 1, . . . , d do Require: α , smooth factor.
4: if j−ith
attribute is numerical then 1: Initialize E = ∅
m 2 2: For each xi ∈ S , compute its neighbourhood N N k (xi ) ⊆ S ∪ S
(i )
m i=1 w j 3: For i = 1, . . . , n initialize δi = (1, 0 ) if yi = 1 and δi = (0, 1 ) if
5: σ j = 1
m i=1 w(ji ) − m
yi = −1
6: end if 4: For i = n + 1, . . . , n + m do δi = (0.5, 0.5 )
7: end for 5:
8: 6: for t = 1, . . . , T do
9: Assign M = T /m 7: for i = 1, . . . , m do
10: for t = 1, . . . , M do 8: Compute total payoff:
11: for i = 1, . . . , m do
12: for j = 1, . . . , d do ui = g(d (x̄i , x j )) · δi · δ Tj
x j ∈N N k ( xi )
13: if j−th attribute is numerical then
14: Choose r ∼ N (0, 1 )
σ 9: Compute positive payoff:
15: w j = w(ji ) − √j
m
·r
16: else u= g(d (x̄i , x j )) · (1, 0 ) · δ Tj
17: Choose w j uniformly over {w(j1 ) , . . . w(jm ) } x j ∈N N k ( xi )

18: end if
10: Assign α = (α + u )/(α + un+1 )
19: end for
11: Update δn+1 = (α , 1 − α )
20: S = S ∪ {(w1 , . . . ,wd )}
12: end for
21: end for
13: end for
22: end for
14:
23:
15: for i = 1, . . . , m do
24: S = Choose T random instances from S
16: if δi1 > 0.5 then
25: return S , synthetic positive instances
17: E = E ∪ {(x̄i , 1 )}
18: end if
Algorithm 5 PDFOS oversampling. 19: end for
20:
Require: S = {xi = (w1(i ) , . . . wd(i ) )}m , positive instances
i=1 21: return E ⊆ S , filtered synthetic positive instances
Require: T , required number of instances
1: Initialize S = ∅
2: Search for h = which minimizes M (h )
Then, Sections B.2–B.6 include the main characteristics of the algo-
3: Find U unbiased covariance matrix of S
rithms together with an explanation of the upsides and downsides
4: Compute U = R · RT with Choleski decomposition
of every oversampling approach.
5:
6: for i = 1, . . . , T do
7: Choose x ∈ S B1. Classification task
8: Pick r with respect to a normal distribution, i.e. r ∼ N d (0, 1 )
9: S = S ∪ {x + hrR} Classification problem is one of the best known problems in the
10: end for machine learning framework. We are given a set of training in-
11: stances, namely, S = {(x1 , y1 ), . . . , (xm , ym )} where xi ∈ X ⊂ Rn and
12: return S , synthetic positive instances yi ∈ Y, with Y = {0, 1}. X and Y will be called domain and label set,
respectively. The training set will be considered as independent and
identically distributed (i.i.d.) samples taken with respect to an un-
known probability P over the set X × Y, denoting that as S ∼ P m .
Appendix A. Installation

To install our package, R language is needed (see https://www.

r-project.org/ for further indications on how to install it). Once R
language is properly installed, it suﬃces to do, in an R interpreter:

install . packages ( " imbalance " )

This command will install the latest version of the package di-
rectly from CRAN, which is the official repository for R packages.
A test set will be a set of i.i.d. instances T =
Appendix B. Description of the algorithms in the package {(x̄1 , ȳ1 ), . . . , (x̄k , ȳk )}, (x̄i , ȳi ) ∈ X × Y, taken with respect to
P. We denote Tx := {x̄1 , . . . , x̄k }. A classifier hS (depending on
Hereafter, the new preprocessing techniques included in the training set S) is a function th at takes an arbitrary test set,
the imbalance package will be further described. To do so, lacking the labels ȳi , and outputs labels for each instance. That
Section B.1 first provides a short introduction on classification task. is, hS (Tx ) = {(x̄1 , ȳ¯ 1 ), . . . , (x̄k , ȳ¯ k )}. The aim of the classification

10 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

problem is to find the classifier that minimizes the labeling error A few interesting considerations:
LP with respect to the true distribution of the data, defined as
• Low k2 is required in order to ensure we do not pick too many
follows:
negative instances in U.
LP (hS ) := E 1[hS ({x} )
= {(x, y )}] • For an opposite reason, a high k3 must be selected to ensure
{(x,y )}∼P
we pick as many positive hard-to-learn borderline examples as
Where 1[Condition] returns 1 if Condition holds, and 0 oth- we can.
erwise. Hence, LP represents an average of the error over the do- • The higher the Cclust parameter, the less and more-populated
main instances, weighted by the probability of extracting those in- clusters we will get.
stances.
Since we do not know the true distribution of the data, we will B2.1. Pros and cons
usually approximate LP (hS ) using the average error over a test set The most evident gain of this algorithm is that it fixes some of
T = {(x̄1 , ȳ1 ), . . . , (x̄k , ȳk )}: the weaknesses of SMOTE. And SMOTE is still one of the main ref-

k erences that researches use as a benchmark to compare their algo-
L ( hS ) = 1[hS ({x̄i , } )
= {(x̄i , ȳi )} rithms. That makes a MWMOTE a state-of-the-art algorithm. Apart
i=1 from that, and although the pseudocode can be quite confusing,
Specifically, imbalance package provides oversampling algo- the idea behind the algorithm is easy to understand.
rithms. Those family of procedures aim to generate a set E of syn- On the other hand, the algorithm relies on the idea that the
thetic positive instances based on the training ones, so that we space between two minority instances is going to belong to a mi-
have a new classification problem with S̄+ = S+ ∪ E, S̄− = S− and nority cluster, which seems like a reasonable hypothesis, but can
S̄ = S̄+ ∪ S̄− our new training set. lead to error in certain datasets (e.g., the minority class spread
across a large number of tiny clusters).
B2. MWMOTE
B3. RACOG and wRACOG
This algorithm, proposed by [15], is one of the many modifi-
cations of SMOTE [12], which is a classic algorithm to treat class These set of algorithms, proposed in [16], assume we want to
imbalance. SMOTE generates new examples by filling empty areas approximate a discrete distribution P (W1 , . . . , Wd ).
among the positive instances. It updates the training set iteratively, The key of the algorithm is to approximate P (W1 , . . . , Wd ) as
d
by performing: i=1 P (Wi | Wn (i ) ) where n (i ) ∈ {1, . . . , d }. Chow–Liu’s algorithm is
used to meet that purpose. This algorithm minimizes Kullback–
E := E ∪ {x + r · (y − x )}, x, y ∈ S+ , r ∼ N (0, 1 ) Leibler distance between two distributions:
But SMOTE has a clear downside: it does not detect noisy in-
DKL (P Q) = P (i )(log P (i ) − log Q (i ) )
stances. Therefore, it can generate synthetic examples out of noisy
i
ones or even between two minority classes, which if not cleansed
up, may end up becoming noise inside a majority class clus- We recall the definition for the mutual information of two ran-
ter. MWMOTE (Majority Weighted Minority Oversampling Technique) dom discrete variables Wi , Wj :
tries to overcome both problems. It intends to give higher weight
p( w 1 , w 2 )
to borderline instances, undersized minority clusters and examples I (Wi , W j ) = p(w1 , w2 ) log
p( w 1 ) p( w 2 )
near the borderline of the two clases. w1 ∈W1 w2 ∈W2
Let’s introduce some notations and definitions:
Let = {xi = (w1(i ) , . . . , wd(i ) )}m
S+ i=1
be the unlabeled positive in-
• d(x, y) stands for the euclidean distance between x and y. stances. To approximate the distribution, we do:
• NNk (x)⊆S will be the k-neighbourhood of x in S (k closest in-
stances with euclidean distance). • Compute G = (E , V ), Chow Liu’s dependence tree.
• N Nik (x ) ⊆ Si , i = +, − will be x’s k- minority (resp. majority) • If r is the root of the tree, define P(Wr |Wn(r) ) := P(Wr ).
neighbourhood.
• For each (u, v) ∈ E arc in the tree, n(v) := u and compute
P(Wv |Wn(v) ).
• C f (x, y ) = αC · f d (x,y
d
) measures the closeness of x to y, that
After that, a Gibbs Sampling scheme is used to extract samples
is, it will measure the proximity of borderline instances. where
with respect to the approximated probability distribution, where a
f = x1[x≤α ] + C1[x>α ]
C (x,y )
badge of new instances is obtained by performing:
• D f (x, y ) = f will represent a density factor, such that
z∈V C f (z,y )
• Given a minority sample xk = (w1(i ) , . . . , wd(i ) )
an instance belonging to a compact cluster will have higher
• Iteratively construct for each attribute
Cf (z, y) than another one belonging to a more sparse cluster.
• Iα ,C (x, y ) = C f (x, y ) · D f (x, y ), where if x ∈
/ N N+3 (y ) then
k w̄k(i ) ∼ P (Wk | w̄1(i ) , . . . , w̄k(i−1
)
, wk(i+1
)
, . . . , wd(i ) )
Iα ,C (x, y ) = 0.
• Return S = {x̄i = (w̄1(i ) , . . . , w̄d(i ) )}m .
i=1
Let Tclust := Cclust · |S1+ | x∈S+ min d (x, y ). We will also use a
f f y∈S+ ,y
=x
f B3.1. RACOG
mean-average agglomerative hierarchical clustering of the minor- RACOG (Rapidly Converging Gibbs) builds a Markov chain for
ity instances with threshold Tclust , that is, we will use a mean dis- each of the m minority instances, ruling out the first β generated
tance: instances and selecting a badge of synthetic examples each α iter-
1 ations. That allows to lose dependence of previous values.
dist (Li , L j ) = d (x, y )
|Li ||L j | x∈L y∈L j
i
B3.2. wRACOG
and having started with a cluster per instance, we will proceed by RACOG depends on α , β and the requested number of in-
joining nearest clusters until minimum of distances is lower than stances. wRACOG (wrapper-based RACOG) tries to overcome that
Tclust . problem. Let wrapper be a binary classifier.

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 11

B3.3. Pros and cons an estimator for f could be the mean number of samples in ]x −
Clearly, RACOG and wRACOG have the advantage that they h, x + h[ (let’s call this number Ih (x1 , . . . , xn )) divided by the length
are highly based on statistical evidence/procedures, and they have of the interval:
guarantees to succeed in their goal. On the contrary, there exists a
I ( x1 , . . . , xn )
substantial downside of those algorithms: they only work on dis- f (x ) = h
2hn
crete variables, which makes them very restrictive with respect to
1
|x| < 1 !
the set of data they can be applied to. If we define ω (x ) = 2 and wh (x ) = w x
h
, then
0 otherwise
B4. RWO we could write
f as:
1
n
RWO (Random Walk Oversampling) is an algorithm introduced

f (x ) = ωh ( x − xi )
nh
by [17], which generates synthetic instances so that mean and de- i=1
viation of numerical attributes remain as close as possible to the In d dimensional case, we define:
original ones.
1
n
This algorithm is motivated by the central limit theorem,
f (x ) = ωh ( x − xi )
which states that given a collection of independent and identi- nhd
i=1
cally distributed random variables, W1 , . . . , Wm , with E(Wi ) = μ
and V ar (Wi ) = σ 2 < ∞, then: B5.2. Kernel methods
⎡ ⎛ ⎞ ⎤ If we took w = 12 1]−1,1[ , then
f would have jump discontinu-
ities and we would have jump derivatives. On the other hand, we
⎢√ ⎜ ⎟ ⎥ "
⎢ m⎜ 1
m
⎟ ⎥ could took ω, where w ≥ 0,
ω (x )dx = 1,
⊆X a domain, and w
lim P ⎢ ⎜ Wi −μ⎟ ≤ z⎥ = φ (z ) even, and that way we could have estimators with more desirable
m ⎣ σ ⎝ m i=1 ⎠ ⎦
properties with respect to continuity and differentiability.

W f can be evaluated through its MISE (Mean Integral Squared Er-
ror):
where φ is the distribution function of N(0, 1). #
That is, σW/−μ
√ → N (0, 1 ) probability-wise.
m
MISE (h ) = E (
f (x ) − f (x ))2 dx
x1 ,...,xd
Let S+ = {xi = (w1(i ) , . . . , wd(i ) )}m
i=1
be the minority instances.
Now, let’s fix some j ∈ {1, . . . , d}, and let’s assume that j- B5.3. The algorithm
ith column follows a numerical random variable Wj , with PDFOS (Probability Distribution density Function estimation based
mean μj and standard deviation σ j < ∞. Let’s compute σ j = Oversampling) was proposed in [18]. It uses multivariate Gaussian
2 kernel methods. The probability density function of a d-Gaussian
m (i )
m i=1 w j distribution with mean 0 and as its covariance matrix is:
1
m i=1 w(ji ) − m the biased estimator for the standard

1 1
deviation. It can be proven that instances generated with w̄ j =
φ (x ) = $ exp − x −1 xT
( 2π · det ( )) d 2
σ
w(ji ) − √j · r, r ∼ N (0, 1 ) have the same sample mean as the origi-
m Let = {xi = (w1(i ) , . . . , wd(i ) )}m
S+ i=1
be the minority instances. The
nal ones, and their sample variance tends to the original one. unbiased covariance estimator is:
1 1
m m
B4.1. Pros and cons U= (xi − x )(xi − x )T , where x = xi
RWO, as it was originally described in [17], uses the sample m−1 m
i=1 i=1
variance, instead of the unbiased sample variance. Therefore we x!
We will use kernel functions φh (x ) = , where h needs to φU
only have guarantees that E(σ j2 ) −→ σ j2 . If we had picked τ j = h
m→∞ be optimized to minimize the MISE. We will pursue to minimize
m

m
2
1
m−1 i=1 w(ji ) − 1
m i=1 w(ji ) , instead, E(τ j ) = σ j2 would hold. the following cross-validation function:
1 ∗
m m
Another downside of the algorithm is its arbitrariness when it 2
M (h ) = φ h ( x i − x j ) + d φh ( 0 )
comes to non-numerical variables. m2 h d mh
i=1 j=1
The most obvious upside of this algorithm is its simplicity and
its good practical results. where φh∗ ≈ φh√2 − 2φh .
Once a proper h has been found, a suitable generating scheme
B5. PDFOS could be to take xi + hRr, where xi ∈ S+ , r ∼ Nd (0, 1) and U =
R · RT , R being upper-triangular. In case we have enough guaran-
Due to the complexity of this preprocessing technique, in this tees to decompose U = R · RT (U must be a positive-definite ma-
case we will structure the description of its working procedure into trix), we could use Choleski decomposition. In fact, we provide a
several subsections, providing the motivation, background tech- sketch of proof showing that all covariance matrices are positive-
semidefinite:
niques, and finally pros and cons. % &

m

B5.1. Motivation yT (xi − x )(xi − x )T y

Given a distribution function of a random variable X, namely i=1

F(x), if that function has an almost everywhere derivative, then, al-

m
m

most everywhere, it holds: = ( ( xi − x )T y )T ( xi − x )T y ) = ||zi ||2 ≥ 0

i=1
i=1
P (x − h < X ≤ x + h ) ziT zi
f (x ) = F (x ) = lim
h→0 2h for arbitrary y ∈ Rd . Hence, we need a strict positive deﬁnite ma-
Given a ﬁxed h, that we will call bandwidth henceforth. Given a trix, otherwise PDFOS would not provide a result and will stop its
collection of random samples X, X1 , . . . , Xn , namely x1 , . . . , xn , then execution.

12 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

B5.4. Search of optimal bandwidth We deﬁne 1 × × n := and we call an element δ =

We take a first approximation to h as the value: (δ1 , . . . , δn ) ∈ an strategy profile. Having a fixed strategy profile

d+4
1 δ , the overall payoff for the ith player is defined as:
4
hSil verman = ui ( δ ) = δi(ti ) fi (t )
m (d + 2 )
(t1 ,...,tn )∈T
where d is number of attributes and m the size of the minority
class. Given ui the payoff for a δ strategy profile in the i-th player and
Reshaping the equation of the cross validation function and dif- δ ∈ , we will denote
ferentiating: δ−i := (δ1 , . . . , δi−1 , δi+1 , . . . , δn ) (B.1)
2
m
1
M (h ) = φh∗ (xi − x j ) + φ √ (0 )
m2 hd mhd h 2
j>i ui (δi , δ−i ) := ui (δ ) (B.2)
dh−1 √
M (h ) = − φ (0 ) A probabilistic Nash equilibrium is a strategy profile x =
mhd h 2 (δ1 , . . . , δn ) verifying ui (δi , δ−i ) ≥ ui (δi , δ−i ) for every other δ ∈ ,
2
'm
and all i = 1, . . . , n.
− 2 d φh∗ (xi − x j )dh−1 A theorem ensures that every game space (P, T, f) with finite
m h
j>i
players and strategies has a probabistic Nash equilibrium.

m (
+ φh∗ (xi − x j )h−3 (xi − x j )T U (xi − x j )
B6.2. The algorithm
j>i
NEATER (filteriNg of ovErsampled dAta using non cooperaTive
And we use a straightforward gradient descent algorithm to gamE theoRy), introduced in [19], is a filtering algorithm based on
find a good h estimation. game theory. Let S be the original training set, E the synthetic gen-
erated instances. Our players are S ∪ E. Every player is able to pick
B5.5. Pros and cons between two different strategies: being a negative instance (0) or
On the one hand, PDFOS makes the assumption that the data being a positive instance (1). Players of S have a fixed strategy,
can be locally approximated by a normal distribution. What is where the ith player would have δi = (0, 1 ) (a 0 strategy) in case
more, it makes the assumption that the same bandwidth gives it is a negative instance or δi = (1, 0 ) (a 1 strategy) otherwise.
good local results in every single point. Another disadvantage of The payoff for a given instance is affected only by its own strat-
egy and its k nearest neighbors in S ∪ E. That is, for every xi ∈!E, we
the algorithm is that not every single covariance matrix has a
Choleski decomposition (a covariance matrix can be shown to be will have ui (δ ) = j∈NNk (x ) (xTi wi j x j ) where wi j = g d (xi , x j ) and
positive-semidefinite, whereas for the Choleski decomposition to g is a decreasing function (greater distances imply a lower payoff).
exist it needs to be a positive-definite matrix). In our implementation, we have considered g(z ) = 1+1z2 , with d the
On the other hand, although it makes some hypothesis, they are euclidean distance.
mild assumptions compared to the results it yields. It also has an Each step should involve an update to the strategy profiles of
enormous theoretical component, which ensures quality results. instances of E. Namely, if xi ∈ E, the following equation will be
used:

1 1
B6. NEATER
δi (0 ) = ,
2 2
Once we have created synthetic examples, we should ask α + ui ((1, 0 ))
δi,1 (n + 1 ) = δ (n )
ourselves how many of those instances are in fact relevant to α + ui (δ (n )) i,1
our problem. Filtering algorithms can be applied to oversampled
δi,2 (n + 1 ) = 1 − δi,1 (n + 1 )
datasets, to erase the least relevant instances.
That is, we reinforce the strategy that is producing the higher
B6.1. Game theory payoff, in detriment to the opposite strategy. This method has
Let (P, T, f) be our game space. We would have a set of play- enough convergence guarantees.
ers, P = {1, . . . , n}, and Ti = {1, . . . , ki }, set of feasible strategies for
References
the ith player, resulting in T = T1 × · · · × Tn . We can easily assign
a payoff to each player taking into account his/her own strategy [1] A. Fernández, S. del Río, N.V. Chawla, F. Herrera, An insight into imbalanced big
as well as other players’ strategies. So f is given by the following data classification: outcomes and challenges, Complex Intell. Syst. 3 (2) (2017)
equation: 105–120, doi:10.1007/s40747- 017- 0037- 9.
[2] Q. Zou, S. Xie, Z. Lin, M. Wu, Y. Ju, Finding the best classification threshold in
f :T −→ Rn imbalanced classification, Big Data Res. 5 (2016) 2–8, doi:10.1016/j.bdr.2015.12.
t −→ ( f1 (t ), . . . , fn (t )) 001.
[3] B. Krawczyk, M. Woźniak, G. Schaefer, Cost-sensitive decision tree ensembles
for effective imbalanced classification, Appl. Soft Comput. 14 (2014) 554–562,
t−i will denote (t1 , . . . , ti−1 , ti+1 , . . . , tn ) and similarly we can de- doi:10.1016/j.asoc.2013.08.014.
note fi (ti , t−i ) = fi (t ). [4] P. Zhou, X. Hu, P. Li, X. Wu, Online feature selection for high-dimensional class-
An strategic Nash equilibrium is a tuple (t1 , . . . , tn ) where imbalanced data, Knowl. Based Syst. 136 (2017) 187–199, doi:10.1016/j.knosys.
fi (ti , t−i ) ≥ fi (ti , t−i ) for every other t ∈ T, and all i = 1, . . . , n. That
2017.09.006.
[5] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data
is, an strategic Nash equilibrium maximizes the payoff for all the Eng. 21 (9) (2009) 1263–1284, doi:10.1109/tkde.2008.239.
players. [6] I. Triguero, S. González, J.M. Moyano, S. García, J. Alcalá-Fdez, J. Luengo, A. Fer-
nández, M.J. del Jesús, L. Sánchez, F. Herrera, KEEL 3.0: an open source soft-
The strategy for each player will be picked with respect to a
ware for multi-stage analysis in data mining, Int. J. Comput. Intell. Syst. 10 (1)
given probability: (2017) 1238, doi:10.2991/ijcis.10.1.82.
[7] G. Lemaitre, F. Nogueira, C.K. Aridas, Imbalanced-learn: a python toolbox to

ki
tackle the curse of imbalanced datasets in machine learning, J. Mach. Learn.
δi ∈ i = {(δi(1) , . . . , δi(ki ) ) ∈ (R+0 )ki : δi( j ) = 1} Res. 18 (2017) 1–5.
j=1
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 13

[8] A.D. Pozzolo, O. Caelen, S. Waterschoot, G. Bontempi, Racing for unbalanced [14] S. Das, S. Datta, B.B. Chaudhuri, Handling data irregularities in classification:
methods selection., in: H. Yin, K. Tang, Y. Gao, F. Klawonn, M. Lee, T. Weise, foundations, trends, and future challenges, Pattern Recognit. 81 (2018) 674–
B. Li, X. Yao (Eds.), Lecture Notes in Computer Science, 8206, Springer, 2013, 693, doi:10.1016/j.patcog.2018.03.008.
pp. 24–31. [15] S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted minority
[9] W. Siriseriwan, Smotefamily: A Collection of Oversampling Techniques for Class oversampling technique for imbalanced data set learning, IEEE Trans. Knowl.
Imbalance Problem Based on SMOTE, 2018, https://cran.r-project.org/package= Data Eng. 26 (2) (2014) 405–425, doi:10.1109/tkde.2012.232.
smotefamily. [16] B. Das, N.C. Krishnan, D.J. Cook, RACOG and wRACOG: two probabilistic over-
[10] N. Lunardon, G. Menardi, N. Torelli, ROSE: a package for binary imbalanced sampling techniques, IEEE Trans. Knowl. Data Eng. 27 (1) (2015) 222–234,
learning, R J. (2014) 8292. doi:10.1109/tkde.2014.2324567.
[11] G. Menardi, N. Torelli, Training and assessing classification rules with im- [17] H. Zhang, M. Li, RWO-sampling: a random walk over-sampling approach to im-
balanced data, Data Min. Knowl. Discov. 28 (2014) 92122, doi:10.1007/ balanced data classification, Inf. Fusion 20 (2014) 99–116, doi:10.1016/j.inffus.
s10618- 012- 0295- 5. 2013.12.003.
[12] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minor- [18] M. Gao, X. Hong, S. Chen, C.J. Harris, E. Khalaf, PDFOS: PDF estimation based
ity over-sampling technique, J. Arif. Intell. Res. 16 (2002) 321–357, doi:10.1613/ over-sampling for imbalanced two-class problems, Neurocomputing 138 (2014)
jair.953. 248–259, doi:10.1016/j.neucom.2014.02.006.
[13] A. Fernandez, S. Garcia, F. Herrera, N.V. Chawla, Smote for learning from imbal- [19] B.A. Almogahed, I.A. Kakadiaris, NEATER: filtering of over-sampled data using
anced data: progress and challenges. Marking the 15-year anniversary, J. Arif. non-cooperative game theory, Soft Comput. 19 (11) (2014) 3301–3322, doi:10.
Intell. Res. 61 (2018) 863–905. 1109/ICPR.2014.245.
[20] J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel
data-mining software tool: data set repository, integration of algorithms and
experimental anlysis framework, J. Mult.-Valued Log. Soft Comput. 17 (2010)
255–287.

A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Learn SAP PP in 1 Day: ALL RIGHTS RESERVED. No Part of This Publication May Be Reproduced or
No ratings yet
Learn SAP PP in 1 Day: ALL RIGHTS RESERVED. No Part of This Publication May Be Reproduced or
23 pages
Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE
No ratings yet
Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE
20 pages
Kumar_2021_IOP_Conf._Ser.__Mater._Sci._Eng._1099_012077
No ratings yet
Kumar_2021_IOP_Conf._Ser.__Mater._Sci._Eng._1099_012077
9 pages
Imbalanced Learn Python
No ratings yet
Imbalanced Learn Python
5 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
Imbalanced_Data_Classification_Method_Based_on_LSSASMOTE
No ratings yet
Imbalanced_Data_Classification_Method_Based_on_LSSASMOTE
9 pages
Instant Access to Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee ebook Full Chapters
100% (3)
Instant Access to Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee ebook Full Chapters
40 pages
[Ebooks PDF] download Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee full chapters
100% (1)
[Ebooks PDF] download Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee full chapters
65 pages
Instant Download Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee PDF All Chapters
100% (3)
Instant Download Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee PDF All Chapters
40 pages
Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee - Quickly download the ebook in PDF format for unlimited reading
100% (1)
Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee - Quickly download the ebook in PDF format for unlimited reading
71 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
s40537-024-00943-4
No ratings yet
s40537-024-00943-4
32 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
SMOTE For Imbalanced Classification With Python
No ratings yet
SMOTE For Imbalanced Classification With Python
75 pages
(Ebook) Imbalanced Classification with Python: Choose Better Metrics, Balance Skewed Classes, and Apply Cost-Sensitive Learning by Jason Brownlee ISBN 9788468452241, 8468452246 - Own the ebook now with all fully detailed content
100% (2)
(Ebook) Imbalanced Classification with Python: Choose Better Metrics, Balance Skewed Classes, and Apply Cost-Sensitive Learning by Jason Brownlee ISBN 9788468452241, 8468452246 - Own the ebook now with all fully detailed content
72 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
11192-Article (PDF) - 20731-1-10-20180420
No ratings yet
11192-Article (PDF) - 20731-1-10-20180420
43 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
11-A-SMOTE_A_new_preprocessing_approach_for_highly_im
No ratings yet
11-A-SMOTE_A_new_preprocessing_approach_for_highly_im
11 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
1 s2.0 S0950705119302898 Main
No ratings yet
1 s2.0 S0950705119302898 Main
17 pages
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
No ratings yet
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
6 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
CLASSIFICATION_OF_IMBALANCED_DATA_A_REVIEW
No ratings yet
CLASSIFICATION_OF_IMBALANCED_DATA_A_REVIEW
34 pages
To SMOTE, or Not To SMOTE?
No ratings yet
To SMOTE, or Not To SMOTE?
10 pages
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
No ratings yet
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
56 pages
An Empirical Comparison and Evaluation of Minority Oversampling
No ratings yet
An Empirical Comparison and Evaluation of Minority Oversampling
13 pages
s42044-025-00240-0
No ratings yet
s42044-025-00240-0
19 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
Imbalanced_Data_Problem_in_Machine_Learning_A_Review
No ratings yet
Imbalanced_Data_Problem_in_Machine_Learning_A_Review
14 pages
IET Communications - 2021 - Le - A Comprehensive Survey of Imbalanced Learning Methods For Bankruptcy Prediction
No ratings yet
IET Communications - 2021 - Le - A Comprehensive Survey of Imbalanced Learning Methods For Bankruptcy Prediction
9 pages
SMOTE for Imbalanced Classification with Python - GeeksforGeeks
No ratings yet
SMOTE for Imbalanced Classification with Python - GeeksforGeeks
18 pages
Oversampling techniques for imbalanced data in regression
No ratings yet
Oversampling techniques for imbalanced data in regression
19 pages
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
No ratings yet
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
18 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Krawczyk2016 Article LearningFromImbalancedDataOpen
No ratings yet
Krawczyk2016 Article LearningFromImbalancedDataOpen
12 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
imbalanced data
No ratings yet
imbalanced data
54 pages
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
No ratings yet
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
16 pages
Pattern Recognition: Muhammad Atif Tahir, Josef Kittler, Fei Yan
No ratings yet
Pattern Recognition: Muhammad Atif Tahir, Josef Kittler, Fei Yan
13 pages
Python Application Development Using Imbalanced-Learn
No ratings yet
Python Application Development Using Imbalanced-Learn
6 pages
2
No ratings yet
2
10 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
l10_machine_learning
No ratings yet
l10_machine_learning
39 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
A Unifying View of Class Overlap and Imbalance
No ratings yet
A Unifying View of Class Overlap and Imbalance
26 pages
Handling Imbalanced Dataset
No ratings yet
Handling Imbalanced Dataset
23 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
From Everand
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
David R Swinburne
No ratings yet
Dr. Ave Olivia Rahman, Msc. Bagian Farmakologi Fkik Unja
No ratings yet
Dr. Ave Olivia Rahman, Msc. Bagian Farmakologi Fkik Unja
18 pages
Smoky Valley Cafe
No ratings yet
Smoky Valley Cafe
3 pages
Black Belt Project Storyboard Template
No ratings yet
Black Belt Project Storyboard Template
27 pages
Kinds and Parts of Statutes by Rufus Rodriguez
No ratings yet
Kinds and Parts of Statutes by Rufus Rodriguez
1 page
NickLangleyLRd1 NathanComments
No ratings yet
NickLangleyLRd1 NathanComments
7 pages
wisdomFleet_2024-12-10
No ratings yet
wisdomFleet_2024-12-10
8 pages
The CMDB As The Brain of IT
No ratings yet
The CMDB As The Brain of IT
4 pages
2003-CFD Modeling For Motor Fan System
No ratings yet
2003-CFD Modeling For Motor Fan System
5 pages
Instant Access to How to Be a Green Liberal Nature Value and Liberal Philosophy 1st Edition Simon Hailwood ebook Full Chapters
100% (5)
Instant Access to How to Be a Green Liberal Nature Value and Liberal Philosophy 1st Edition Simon Hailwood ebook Full Chapters
48 pages
Guerrero-Witt - Wood.2010 Self-Regulation of Gendered Behavior in Everyday Life
No ratings yet
Guerrero-Witt - Wood.2010 Self-Regulation of Gendered Behavior in Everyday Life
12 pages
US History California Standards Test With Answer Key 2009
No ratings yet
US History California Standards Test With Answer Key 2009
29 pages
Grade 8 September Bulletin 21694492598572
No ratings yet
Grade 8 September Bulletin 21694492598572
6 pages
ملخص النقد بالكامل
No ratings yet
ملخص النقد بالكامل
11 pages
Uncertainty Estimation For SDAR-OES Internal Standard Method
No ratings yet
Uncertainty Estimation For SDAR-OES Internal Standard Method
7 pages
Set.1.Claros - Sux.digests Complete
No ratings yet
Set.1.Claros - Sux.digests Complete
4 pages
Musawwir - Inferioritas Pada Siswa SMP Dan MTS Di Pulau Ternate
No ratings yet
Musawwir - Inferioritas Pada Siswa SMP Dan MTS Di Pulau Ternate
8 pages
Thematic Unit 1
No ratings yet
Thematic Unit 1
16 pages
HL6 Crouzon Syndrome
No ratings yet
HL6 Crouzon Syndrome
4 pages
Working During Pregnancy - UpToDate
No ratings yet
Working During Pregnancy - UpToDate
20 pages
Book Review: Education Policies in Pakistan: Politics, Projections, and Practices
100% (1)
Book Review: Education Policies in Pakistan: Politics, Projections, and Practices
3 pages
1 - Golden Gate Fields Retrospecto
No ratings yet
1 - Golden Gate Fields Retrospecto
9 pages
BG Chapter 03 Karma Yoga
No ratings yet
BG Chapter 03 Karma Yoga
5 pages
Assessment Report Group 1
No ratings yet
Assessment Report Group 1
40 pages
CDX-565MXRF - Service 2 PDF
No ratings yet
CDX-565MXRF - Service 2 PDF
54 pages
Question Bank For IA Test-1
No ratings yet
Question Bank For IA Test-1
4 pages
A1-A2 General Knowledge Quiz
No ratings yet
A1-A2 General Knowledge Quiz
32 pages
What A Beautiful Name Chord Chart Preview
No ratings yet
What A Beautiful Name Chord Chart Preview
17 pages
Calcium Homeostasis and Osteoporosis - McMaster Pathophysiology Review
No ratings yet
Calcium Homeostasis and Osteoporosis - McMaster Pathophysiology Review
8 pages
Performance Improvement - 2011 - Maccoby - Strategic Intelligence A Conceptual System of Leadership For Change
No ratings yet
Performance Improvement - 2011 - Maccoby - Strategic Intelligence A Conceptual System of Leadership For Change
9 pages

Oversampling Algorithms in R

Uploaded by

Oversampling Algorithms in R

Uploaded by

JID: KNOSYS

ARTICLE IN PRESS [m5G;September 16, 2018;15:4]

Contents lists available at ScienceDirect

Imbalance: Oversampling algorithms for imbalanced classiﬁcation in R

1. Introduction First, regarding Java we may ﬁnd a complete module of imbal-

2 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

Fig. 1. PDFOS applied to newthyroid1 dataset.

Property Imbalance Unbalanced Smotefamily Rose

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 3

Fig. 2. PDFOS + NEATER applied to newthyroid1.

Nr. Code metadata description Please ﬁll in this column

C1 Current code version 1.0.0

ware framework and enumerates the implemented algorithms. 2. Software

4 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

Fig. 3. Original banana dataset.

Fig. 4. Imbalanced banana dataset.

Fig. 5. SMOTE applied to imbalanced banana.

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 5

Fig. 6. MWMOTE applied to imbalanced banana.

Fig. 7. RWO applied to imbalanced banana.

Fig. 8. PDFOS applied to imbalanced banana.

6 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 7

# Load the previously installed imbalance package

# Compute the imbalance ratio of newthyroid1 ( that is ,

# Generate 80 new minority instances for newthyroid1 using

# Add the new samples to the original newthyroid1 dataset and

# Compare the three first variables of the extended dataset

# Filter synthetic examples from newSamples , so that

# Add the new filtered samples to the original dataset

# Compare the three first variables of the extended filtered

8 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 9

Algorithm 4 RWO oversampling. Algorithm 6 NEATER ﬁltering.

To install our package, R language is needed (see https://www.

install . packages ( " imbalance " )

10 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 11

B5.1. Motivation yT (xi − x )(xi − x )T y

F(x), if that function has an almost everywhere derivative, then, al-

most everywhere, it holds: = ( ( xi − x )T y )T ( xi − x )T y ) = ||zi ||2 ≥ 0

12 I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13

B5.4. Search of optimal bandwidth We deﬁne 1 × × n := and we call an element δ =

I. Cordón et al. / Knowledge-Based Systems 000 (2018) 1–13 13

You might also like