Oversampling Algorithms in R
Oversampling Algorithms in R
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
a r t i c l e i n f o a b s t r a c t
Article history: Addressing imbalanced datasets in classification tasks is a relevant topic in research studies. The main
Received 1 March 2018 reason is that for standard classification algorithms, the success rate when identifying minority class
Revised 14 June 2018
instances may be adversely affected. Among different solutions to cope with this problem, data level
Accepted 25 July 2018
techniques have shown a robust behavior. In this paper, the novel imbalance package is introduced.
Available online xxx
Written in R and C++, and available at CRAN repository, this library includes recent relevant oversampling
Keywords: algorithms to improve the quality of data in imbalanced datasets, prior to performing a learning task. The
Oversampling main features of the package, as well as some illustrative examples of its use are detailed throughout this
Imbalanced classification manuscript.
Machine learning
Preprocessing © 2018 Elsevier B.V. All rights reserved.
SMOTE
https://doi.org/10.1016/j.knosys.2018.07.035
0950-7051/© 2018 Elsevier B.V. All rights reserved.
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
Table 1
Comparison of the proposed imbalance package to the available R packages for imbal-
anced classification.
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
Table 2
Code metadata (mandatory).
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
Since the original SMOTE algorithm in 2002 [12], many different original SMOTE algorithm [12,13]. It assigns higher weight to
approaches have been designed to improve the classification per- borderline instances, undersized minority clusters and exam-
formance under different scenarios [13,14]. In this new developed ples near the borderline of the two classes.
software package, we have compiled some of the newest oversam- • racog, wracog. Rapidly Converging Gibbs (RACOG) and
pling algorithms, which are listed below: wrapper-based RACOG (wRACOG), both proposed by [16], work
for discrete attributes. They generate new examples with re-
• mwmote. The Majority Weighted Minority Oversampling Tech- spect to an approximated distribution using a Gibbs Sampler
nique (MWMOTE), first proposed in [15], is an extension of the scheme. RACOG needs the number of instances to generate be-
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
forehand. wRACOG requires a target classifier to show no im- Gaussian kernel methods to locally approximate the minority
provement to stop generating examples. class.
• rwo. Random Walk Oversampling (RWO) is an algorithm in-
troduced by [17], which generates synthetic instances so that
mean and deviation of numerical attributes remain close to the Apart from those oversampling methods, we provide a filter-
original ones. ing method called neater. The filteriNg of ovErsampled dAta us-
• pdfos. Probability Distribution density Function estimation based ing non cooperaTive gamE theoRy (NEATER), introduced in [19], is
Oversampling (PDFOS) was proposed in [18]. It uses multivariate highly based on game theory. It discards the instances with higher
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
probability of belonging to the opposite class, based on each in- oversampling (techniques beyond SMOTE), and filtering methods.
stance neighborhood. We also show whether an automatic wrapper procedure is avail-
The package also includes the method oversample, which is able, and if it includes a visualization of the preprocessed output.
a wrapper that eases calls to the described and already existing From Table 1 we may conclude that imbalance is a com-
methods. plete solution with many relevant oversampling techniques. It is
To evaluate the oversampling process, we propose a visual by far the one with the largest number of approaches, and the only
method, called plotComparison. It plots a pairwise comparative one that includes oversampling approaches beyond the traditional
grid of a selected set of attributes, both in the original dataset and SMOTE scheme. Finally, we must stress the relevance of the visual-
the oversampled one. That way, if a proper oversampling has been ization feature, which can be very useful for practitioners to check
performed, we expect to see larger minority clusters in the result- the areas where the minority class is mainly reinforced (Table 2).
ing dataset.
In addition to this, imbalance includes some datasets 3. Examples of use
from the KEEL [6,20] repository (http://www.keel.es/datasets.php),
which can be used to perform experiments. Additional datasets can The following example loads the dataset newthyroid1, in-
be easily imported under a single constraint: they must contain cluded in our package, and applies the algorithm PDFOS to the
a class column (not necessarily the last one) having two different dataset, requesting 80 new instances. newthyroid1 is a classi-
values. cal dataset that has a series of 5 different medical measurements
To conclude this section, we show in Table 1 a comparison as attributes, and classifies every patient in hyperthyroidism (42
of the main features of this novel imbalance package with instances) or non-hyperthyroidism (173 instances). Once the algo-
respect to the previous software solutions available at CRAN: rithm has been applied, we plot a pairwise visual comparison be-
unbalanced [8], smotefamily [9], and rose [10,11]. Among tween first three attributes of the original and modified datasets.
the properties that are contrasted, we included the latest date of Result can be observed in Fig. 1, and in Fig. 2, after applying filter-
release, the number of preprocessing techniques that are avail- ing.
able, and if it includes different approaches, namely undersam-
pling, oversampling, SMOTE (and variants/extensions), advanced
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
The banana dataset, which has been included in the package, Algorithm 2 RACOG oversampling.
is a binary dataset with 5300 samples and two attributes (apart
Require: S = {x1 , . . . ,xm }, positive examples
from class attribute), artificially developed to represent a banana
Require: β , burnin
when plotted in two dimensions with each class filled with a dif-
Require: α , lag
ferent color. A straightforward Random Undersampling has yield
Require: T , requested number of synthetic examples
an imbalance dataset (with 10% imbalance ratio) from the origi-
1: P = approximation of the S distribution
nal dataset, both observable in Figs. 3 and 4, respectively. Origi-
2: S = ∅
T
nal dataset has been included as banana-orig in the package;
3: M = m · α + β
imbalanced one, as banana. We provide a visual comparison be-
4:
tween results of applying an oversample to reach 50% of imbalance
5: for t = 1, . . . , M do
ratio in banana, and a later filtering using NEATER. Applied over-
6: S = GibbsSampler(S, P )
sampling techniques are SMOTE (Fig. 5), MWMOTE (Fig. 6), RWO
7: if t > β and t mod (α ) = 0 then
(Fig. 7) and PDFOS (Fig. 8).
8: S = S ∪ S
9: end if
Algorithm 1 MWMOTE oversampling. 10: end for
Require: S+ = {x1 , . . . ,xm }, minority instances 11:
Require: S− = {y1 , . . . ,ym }, majority instances 12: S = Pick T random instances from S
Require: T , requested number of synthetic examples 13: return S , synthetic examples
Require: k1 , KNN parameter to filter noisy instances of S+
Require: k2 , KNN parameter to compute boundary U ⊆ S−
Require: K3 , KNN parameter to compute boundary V ⊆ S+
Algorithm 3 wRACOG oversampling.
Require: α , tolerance for the closeness level to the borderline
Require: Strain = {zi = (xi , yi )}m
i=1
, train instances
Require: C, weight of the closeness factor to the borderline
Require: Sval , validation instances
Require: Cclust
Require: wrapper
1: Initialize S = ∅
Require: T , requested number of synthetic examples
2: For each x ∈ S+ , compute its k1 KNN neighbourhood, N N k1 (x )
Require: α , tolerance parameter
3: Let S = S+ − {x ∈ S+ : N N k1 (x ) ∩ S+ = ∅}
+
+
f
1: S = S
k train
4: Compute U = N N−2 (x ) 2: P = approximation of the S distribution
x∈S +
f 3: Build a model with wrapper and Strain
5: Compute V =
k
N N+3 (x ) 4: Initialize S = ∅
x∈U 5: Initialize τ = (+∞, . . . , +∞ )
6: For each x ∈ V , compute P (x ) = y∈U Iα ,C (x, y ) 1) T)
6:
7: Normalize P (x ) for each x ∈ V , P (x ) = P (xP)(z )
z∈V 7: while The standard deviation of τ ≥ α do
8: Compute 8: S = GibbsSampler(S, P )
1 9: S ⊇ Smisc = misclassified examples by model
Tclust = Cclust · min d (x, y ) Update S = S ∪ Smisc
|S+f | x∈S+ y∈S+f ,y
=x 10:
f 11: Update train set Strain = Strain ∪ Smisc
12: Build new model with wrapper, Strain
9: Let L1 , . . . ,LM ⊆ S+ be the clusters for S+ , with Tclust as thresh- 13: Let s = sensitivity of model over Sval
old 14: Let τ = (τ2 , . . . , τT , s )
10:
15: end while
11: for t = 1, . . . , T do 16:
12: Pick x ∈ V with respect to P (x ) 17: return S , synthetic examples
13: Uniformly pick y ∈ Le inside Le x, where x ∈ Le
14: Uniformly pick r ∈ [0, 1]
15: S = S ∪ {x + r ( y − x )}
16: end for for researchers and practitioners. Secondly, to include a visualiza-
17:
tion environment for the sake of observing those areas of the mi-
18: return S , synthetic examples nority class that are actually reinforced by means of preprocessing.
Finally, to enable a simpler integration of these methods with the
existing oversampling packages at CRAN.
As future work, we propose to keep maintaining and adding
4. Conclusions functionality to our new imbalance package. Specifically, we
plan to include those new oversampling techniques that are reg-
Class imbalance in datasets is one of the most decisive factors ularly proposed in the specialized literature. In this sense, we con-
to take into account when performing classification tasks. To im- sider that there are good prospects to improve the software in the
prove the behavior of classification algorithms in imbalanced do- near future.
mains, one of the most common and effective approaches is to ap-
ply oversampling as a preprocessing approach. It works by creating
synthetic minority instances to increase the number of representa-
tives belonging to that class.
In this paper we have presented the imbalance package for Acknowledgments
R. It was intended to alleviate some drawbacks that arise in cur-
rent software solutions. Firstly, to provide useful implementations This work is supported by the Project BigDaP-TOOLS - Ayudas
for those novel oversampling methods that were not yet available Fundación BBVA a Equipos de Investigación Científica 2016.
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
18: end if
10: Assign α = (α + u )/(α + un+1 )
19: end for
11: Update δn+1 = (α , 1 − α )
20: S = S ∪ {(w1 , . . . ,wd )}
12: end for
21: end for
13: end for
22: end for
14:
23:
15: for i = 1, . . . , m do
24: S = Choose T random instances from S
16: if δi1 > 0.5 then
25: return S , synthetic positive instances
17: E = E ∪ {(x̄i , 1 )}
18: end if
Algorithm 5 PDFOS oversampling. 19: end for
20:
Require: S = {xi = (w1(i ) , . . . wd(i ) )}m , positive instances
i=1 21: return E ⊆ S , filtered synthetic positive instances
Require: T , required number of instances
1: Initialize S = ∅
2: Search for h = which minimizes M (h )
Then, Sections B.2–B.6 include the main characteristics of the algo-
3: Find U unbiased covariance matrix of S
rithms together with an explanation of the upsides and downsides
4: Compute U = R · RT with Choleski decomposition
of every oversampling approach.
5:
6: for i = 1, . . . , T do
7: Choose x ∈ S B1. Classification task
8: Pick r with respect to a normal distribution, i.e. r ∼ N d (0, 1 )
9: S = S ∪ {x + hrR} Classification problem is one of the best known problems in the
10: end for machine learning framework. We are given a set of training in-
11: stances, namely, S = {(x1 , y1 ), . . . , (xm , ym )} where xi ∈ X ⊂ Rn and
12: return S , synthetic positive instances yi ∈ Y, with Y = {0, 1}. X and Y will be called domain and label set,
respectively. The training set will be considered as independent and
identically distributed (i.i.d.) samples taken with respect to an un-
known probability P over the set X × Y, denoting that as S ∼ P m .
Appendix A. Installation
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
problem is to find the classifier that minimizes the labeling error A few interesting considerations:
LP with respect to the true distribution of the data, defined as
• Low k2 is required in order to ensure we do not pick too many
follows:
negative instances in U.
LP (hS ) := E 1[hS ({x} )
= {(x, y )}] • For an opposite reason, a high k3 must be selected to ensure
{(x,y )}∼P
we pick as many positive hard-to-learn borderline examples as
Where 1[Condition] returns 1 if Condition holds, and 0 oth- we can.
erwise. Hence, LP represents an average of the error over the do- • The higher the Cclust parameter, the less and more-populated
main instances, weighted by the probability of extracting those in- clusters we will get.
stances.
Since we do not know the true distribution of the data, we will B2.1. Pros and cons
usually approximate LP (hS ) using the average error over a test set The most evident gain of this algorithm is that it fixes some of
T = {(x̄1 , ȳ1 ), . . . , (x̄k , ȳk )}: the weaknesses of SMOTE. And SMOTE is still one of the main ref-
k erences that researches use as a benchmark to compare their algo-
L ( hS ) = 1[hS ({x̄i , } )
= {(x̄i , ȳi )} rithms. That makes a MWMOTE a state-of-the-art algorithm. Apart
i=1 from that, and although the pseudocode can be quite confusing,
Specifically, imbalance package provides oversampling algo- the idea behind the algorithm is easy to understand.
rithms. Those family of procedures aim to generate a set E of syn- On the other hand, the algorithm relies on the idea that the
thetic positive instances based on the training ones, so that we space between two minority instances is going to belong to a mi-
have a new classification problem with S̄+ = S+ ∪ E, S̄− = S− and nority cluster, which seems like a reasonable hypothesis, but can
S̄ = S̄+ ∪ S̄− our new training set. lead to error in certain datasets (e.g., the minority class spread
across a large number of tiny clusters).
B2. MWMOTE
B3. RACOG and wRACOG
This algorithm, proposed by [15], is one of the many modifi-
cations of SMOTE [12], which is a classic algorithm to treat class These set of algorithms, proposed in [16], assume we want to
imbalance. SMOTE generates new examples by filling empty areas approximate a discrete distribution P (W1 , . . . , Wd ).
among the positive instances. It updates the training set iteratively, The key of the algorithm is to approximate P (W1 , . . . , Wd ) as
d
by performing: i=1 P (Wi | Wn (i ) ) where n (i ) ∈ {1, . . . , d }. Chow–Liu’s algorithm is
used to meet that purpose. This algorithm minimizes Kullback–
E := E ∪ {x + r · (y − x )}, x, y ∈ S+ , r ∼ N (0, 1 ) Leibler distance between two distributions:
But SMOTE has a clear downside: it does not detect noisy in-
DKL (P Q) = P (i )(log P (i ) − log Q (i ) )
stances. Therefore, it can generate synthetic examples out of noisy
i
ones or even between two minority classes, which if not cleansed
up, may end up becoming noise inside a majority class clus- We recall the definition for the mutual information of two ran-
ter. MWMOTE (Majority Weighted Minority Oversampling Technique) dom discrete variables Wi , Wj :
tries to overcome both problems. It intends to give higher weight
p( w 1 , w 2 )
to borderline instances, undersized minority clusters and examples I (Wi , W j ) = p(w1 , w2 ) log
p( w 1 ) p( w 2 )
near the borderline of the two clases. w1 ∈W1 w2 ∈W2
Let’s introduce some notations and definitions:
Let = {xi = (w1(i ) , . . . , wd(i ) )}m
S+ i=1
be the unlabeled positive in-
• d(x, y) stands for the euclidean distance between x and y. stances. To approximate the distribution, we do:
• NNk (x)⊆S will be the k-neighbourhood of x in S (k closest in-
stances with euclidean distance). • Compute G = (E , V ), Chow Liu’s dependence tree.
• N Nik (x ) ⊆ Si , i = +, − will be x’s k- minority (resp. majority) • If r is the root of the tree, define P(Wr |Wn(r) ) := P(Wr ).
neighbourhood.
• For each (u, v) ∈ E arc in the tree, n(v) := u and compute
P(Wv |Wn(v) ).
• C f (x, y ) = αC · f d (x,y
d
) measures the closeness of x to y, that
After that, a Gibbs Sampling scheme is used to extract samples
is, it will measure the proximity of borderline instances. where
with respect to the approximated probability distribution, where a
f = x1[x≤α ] + C1[x>α ]
C (x,y )
badge of new instances is obtained by performing:
• D f (x, y ) = f will represent a density factor, such that
z∈V C f (z,y )
• Given a minority sample xk = (w1(i ) , . . . , wd(i ) )
an instance belonging to a compact cluster will have higher
• Iteratively construct for each attribute
Cf (z, y) than another one belonging to a more sparse cluster.
• Iα ,C (x, y ) = C f (x, y ) · D f (x, y ), where if x ∈
/ N N+3 (y ) then
k w̄k(i ) ∼ P (Wk | w̄1(i ) , . . . , w̄k(i−1
)
, wk(i+1
)
, . . . , wd(i ) )
Iα ,C (x, y ) = 0.
• Return S = {x̄i = (w̄1(i ) , . . . , w̄d(i ) )}m .
i=1
Let Tclust := Cclust · |S1+ | x∈S+ min d (x, y ). We will also use a
f f y∈S+ ,y
=x
f B3.1. RACOG
mean-average agglomerative hierarchical clustering of the minor- RACOG (Rapidly Converging Gibbs) builds a Markov chain for
ity instances with threshold Tclust , that is, we will use a mean dis- each of the m minority instances, ruling out the first β generated
tance: instances and selecting a badge of synthetic examples each α iter-
1 ations. That allows to lose dependence of previous values.
dist (Li , L j ) = d (x, y )
|Li ||L j | x∈L y∈L j
i
B3.2. wRACOG
and having started with a cluster per instance, we will proceed by RACOG depends on α , β and the requested number of in-
joining nearest clusters until minimum of distances is lower than stances. wRACOG (wrapper-based RACOG) tries to overcome that
Tclust . problem. Let wrapper be a binary classifier.
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
B3.3. Pros and cons an estimator for f could be the mean number of samples in ]x −
Clearly, RACOG and wRACOG have the advantage that they h, x + h[ (let’s call this number Ih (x1 , . . . , xn )) divided by the length
are highly based on statistical evidence/procedures, and they have of the interval:
guarantees to succeed in their goal. On the contrary, there exists a
I ( x1 , . . . , xn )
substantial downside of those algorithms: they only work on dis- f (x ) = h
2hn
crete variables, which makes them very restrictive with respect to
1
|x| < 1 !
the set of data they can be applied to. If we define ω (x ) = 2 and wh (x ) = w x
h
, then
0 otherwise
B4. RWO we could write
f as:
1
n
RWO (Random Walk Oversampling) is an algorithm introduced
f (x ) = ωh ( x − xi )
nh
by [17], which generates synthetic instances so that mean and de- i=1
viation of numerical attributes remain as close as possible to the In d dimensional case, we define:
original ones.
1
n
This algorithm is motivated by the central limit theorem,
f (x ) = ωh ( x − xi )
which states that given a collection of independent and identi- nhd
i=1
cally distributed random variables, W1 , . . . , Wm , with E(Wi ) = μ
and V ar (Wi ) = σ 2 < ∞, then: B5.2. Kernel methods
⎡ ⎛ ⎞ ⎤ If we took w = 12 1]−1,1[ , then
f would have jump discontinu-
ities and we would have jump derivatives. On the other hand, we
⎢√ ⎜ ⎟ ⎥ "
⎢ m⎜ 1
m
⎟ ⎥ could took ω, where w ≥ 0,
ω (x )dx = 1,
⊆X a domain, and w
lim P ⎢ ⎜ Wi −μ⎟ ≤ z⎥ = φ (z ) even, and that way we could have estimators with more desirable
m ⎣ σ ⎝ m i=1 ⎠ ⎦
properties with respect to continuity and differentiability.
W f can be evaluated through its MISE (Mean Integral Squared Er-
ror):
where φ is the distribution function of N(0, 1). #
That is, σW/−μ
√ → N (0, 1 ) probability-wise.
m
MISE (h ) = E (
f (x ) − f (x ))2 dx
x1 ,...,xd
Let S+ = {xi = (w1(i ) , . . . , wd(i ) )}m
i=1
be the minority instances.
Now, let’s fix some j ∈ {1, . . . , d}, and let’s assume that j- B5.3. The algorithm
ith column follows a numerical random variable Wj , with PDFOS (Probability Distribution density Function estimation based
mean μj and standard deviation σ j < ∞. Let’s compute σ j = Oversampling) was proposed in [18]. It uses multivariate Gaussian
2 kernel methods. The probability density function of a d-Gaussian
m (i )
m i=1 w j distribution with mean 0 and as its covariance matrix is:
1
m i=1 w(ji ) − m the biased estimator for the standard
1 1
deviation. It can be proven that instances generated with w̄ j =
φ (x ) = $ exp − x −1 xT
( 2π · det ( )) d 2
σ
w(ji ) − √j · r, r ∼ N (0, 1 ) have the same sample mean as the origi-
m Let = {xi = (w1(i ) , . . . , wd(i ) )}m
S+ i=1
be the minority instances. The
nal ones, and their sample variance tends to the original one. unbiased covariance estimator is:
1 1
m m
B4.1. Pros and cons U= (xi − x )(xi − x )T , where x = xi
RWO, as it was originally described in [17], uses the sample m−1 m
i=1 i=1
variance, instead of the unbiased sample variance. Therefore we x!
We will use kernel functions φh (x ) = , where h needs to φU
only have guarantees that E(σ j2 ) −→ σ j2 . If we had picked τ j = h
m→∞ be optimized to minimize the MISE. We will pursue to minimize
m
m
2
1
m−1 i=1 w(ji ) − 1
m i=1 w(ji ) , instead, E(τ j ) = σ j2 would hold. the following cross-validation function:
1 ∗
m m
Another downside of the algorithm is its arbitrariness when it 2
M (h ) = φ h ( x i − x j ) + d φh ( 0 )
comes to non-numerical variables. m2 h d mh
i=1 j=1
The most obvious upside of this algorithm is its simplicity and
its good practical results. where φh∗ ≈ φh√2 − 2φh .
Once a proper h has been found, a suitable generating scheme
B5. PDFOS could be to take xi + hRr, where xi ∈ S+ , r ∼ Nd (0, 1) and U =
R · RT , R being upper-triangular. In case we have enough guaran-
Due to the complexity of this preprocessing technique, in this tees to decompose U = R · RT (U must be a positive-definite ma-
case we will structure the description of its working procedure into trix), we could use Choleski decomposition. In fact, we provide a
several subsections, providing the motivation, background tech- sketch of proof showing that all covariance matrices are positive-
semidefinite:
niques, and finally pros and cons. % &
m
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035
JID: KNOSYS
ARTICLE IN PRESS [m5G;September 16, 2018;15:4]
[8] A.D. Pozzolo, O. Caelen, S. Waterschoot, G. Bontempi, Racing for unbalanced [14] S. Das, S. Datta, B.B. Chaudhuri, Handling data irregularities in classification:
methods selection., in: H. Yin, K. Tang, Y. Gao, F. Klawonn, M. Lee, T. Weise, foundations, trends, and future challenges, Pattern Recognit. 81 (2018) 674–
B. Li, X. Yao (Eds.), Lecture Notes in Computer Science, 8206, Springer, 2013, 693, doi:10.1016/j.patcog.2018.03.008.
pp. 24–31. [15] S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted minority
[9] W. Siriseriwan, Smotefamily: A Collection of Oversampling Techniques for Class oversampling technique for imbalanced data set learning, IEEE Trans. Knowl.
Imbalance Problem Based on SMOTE, 2018, https://cran.r-project.org/package= Data Eng. 26 (2) (2014) 405–425, doi:10.1109/tkde.2012.232.
smotefamily. [16] B. Das, N.C. Krishnan, D.J. Cook, RACOG and wRACOG: two probabilistic over-
[10] N. Lunardon, G. Menardi, N. Torelli, ROSE: a package for binary imbalanced sampling techniques, IEEE Trans. Knowl. Data Eng. 27 (1) (2015) 222–234,
learning, R J. (2014) 8292. doi:10.1109/tkde.2014.2324567.
[11] G. Menardi, N. Torelli, Training and assessing classification rules with im- [17] H. Zhang, M. Li, RWO-sampling: a random walk over-sampling approach to im-
balanced data, Data Min. Knowl. Discov. 28 (2014) 92122, doi:10.1007/ balanced data classification, Inf. Fusion 20 (2014) 99–116, doi:10.1016/j.inffus.
s10618- 012- 0295- 5. 2013.12.003.
[12] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minor- [18] M. Gao, X. Hong, S. Chen, C.J. Harris, E. Khalaf, PDFOS: PDF estimation based
ity over-sampling technique, J. Arif. Intell. Res. 16 (2002) 321–357, doi:10.1613/ over-sampling for imbalanced two-class problems, Neurocomputing 138 (2014)
jair.953. 248–259, doi:10.1016/j.neucom.2014.02.006.
[13] A. Fernandez, S. Garcia, F. Herrera, N.V. Chawla, Smote for learning from imbal- [19] B.A. Almogahed, I.A. Kakadiaris, NEATER: filtering of over-sampled data using
anced data: progress and challenges. Marking the 15-year anniversary, J. Arif. non-cooperative game theory, Soft Comput. 19 (11) (2014) 3301–3322, doi:10.
Intell. Res. 61 (2018) 863–905. 1109/ICPR.2014.245.
[20] J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel
data-mining software tool: data set repository, integration of algorithms and
experimental anlysis framework, J. Mult.-Valued Log. Soft Comput. 17 (2010)
255–287.
Please cite this article as: I. Cordón et al., Imbalance: Oversampling algorithms for imbalanced classification in R, Knowledge-Based
Systems (2018), https://doi.org/10.1016/j.knosys.2018.07.035