Balancing Data
Balancing Data
1
Contents
I Balancing Techniques 4
1 Introduction 4
2 Undersampling 4
2.1 Random-Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Technique introduction . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 Library and Execution example . . . . . . . . . . . . . . . . . . 6
2.2 Near miss Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Pseudo code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Tomek Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Technique Introduction and History . . . . . . . . . . . . . . . 7
2.3.2 Pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 KNN Order (K-nearest Neighbor Order) . . . . . . . . . . . . . . . . 9
2.4.1 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 CNN( Condensed Nearest Neighbor method) . . . . . . . . . . . . . . 11
2.5.1 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.3 Library and Execution Example . . . . . . . . . . . . . . . . . 12
2.6 Cluster Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 Technique Introduction and History . . . . . . . . . . . . . . . 12
2.6.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.3 Library and Execution example . . . . . . . . . . . . . . . . . . 13
3 OverSampling 14
3.1 Random Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Technique introduction and history . . . . . . . . . . . . . . . 14
3.1.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Library and Execution example . . . . . . . . . . . . . . . . . . 15
3.2 Synthetic Minority Oversampling . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Technique introduction and history . . . . . . . . . . . . . . . 15
3.2.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Library and Execution example . . . . . . . . . . . . . . . . . . 16
3.3 Adaptive Synthetic Sampling . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Technique introduction and history . . . . . . . . . . . . . . . 17
3.3.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.4 Library and Execution example . . . . . . . . . . . . . . . . . . 19
3.4 Borderline Synthetic Minority Oversampling . . . . . . . . . . . . . . . 19
3.4.1 Technique introduction and history . . . . . . . . . . . . . . . 19
3.4.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.4 Library and Execution example . . . . . . . . . . . . . . . . . . 20
2
4 Combination of Techniques 22
4.1 Combination of Oversampling and UnderSampling . . . . . . . . . . . 22
4.2 Combination of K-Means and SMOTE . . . . . . . . . . . . . . . . . . 23
4.2.1 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Library and Execution example . . . . . . . . . . . . . . . . . . 24
List of Figures
1 Fraud detection-Imbalanced data . . . . . . . . . . . . . . . . . . . . . 4
2 Execution example of RUS . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Execution example of Near-miss . . . . . . . . . . . . . . . . . . . . . . 8
4 Tomek Links figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Execution Example of CNN . . . . . . . . . . . . . . . . . . . . . . . 12
6 Cluster Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7 Execution example of Cluster Centroid for Undersampling . . . . . . . 13
8 Random OverSampling Python execution. . . . . . . . . . . . . . . . 15
9 Synthetic Minority Oversampling Technique Python execution. . . . . 17
10 Adaptive Synthetic Sampling Python execution. . . . . . . . . . . . . 19
11 Borderline Synthetic Minority Oversampling . . . . . . . . . . . . . . . 21
12 Algorithm explaining the combination of both K-means and SMOTE
techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
13 Implementation of Kmeans SMOTE cobination . . . . . . . . . . . . . 24
14 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
15 Example of AUC=100% . . . . . . . . . . . . . . . . . . . . . . . . . . 27
16 Example of AUC=70% . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
17 Example of AUC=50% . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
18 Example of AUC=0% . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3
Part I
Balancing Techniques
1 Introduction
Real world datasets commonly show the particularity to have a number of samples
of a given class under-represented compared to other classes. This imbalance gives
rise to the “class imbalance” problem (Prati et al., 2009) (or “curse of imbalanced
datasets”) which is the problem of learning a concept from the class that has a small
number of samples. The class imbalance problem has been encountered in multiple
areas such as telecommunication managements, bioinformatics, fraud detection, and
medical diagnosis, and has been considered one of the top 10 problems in data mining
and pattern recognition (Yang and Wu, 2006; Rastgoo et al., 2016). Imbalanced
data typically refers to a problem with classification problems where the classes are
not represented equally, including binary classification problems as well as multi-class
classification problems1. For multi-classification problem, the category with more
data samples is called majority category, while the category with less data samples
is called minority category. For binary classification problems, categories with more
data samples are called negative samples, and categories with less data samples are
called positive samples2.
2 Undersampling
Undersampling is one of several methods of dealing with imbalanced data. The idea
of undersampling is to balance the data by filter some of the overrepresented data to
have around the same number of examples for each class. Undersampling is a quite
fast method when used for imbalanced data due to a lot of majority class examples
are ignored. The downside of the method is some of the useful information can be
within those ignored examples.
2.1 Random-Undersampling
2.1.1 Technique introduction
The method used in Random Undersampling is to randomly remove samples of the
majority class to even the samples with the minority class (He & Ma, 2013). This
method is easy to implement and also fast to execute, which is good for very large
4
and complex datasets. A downside with removing samples randomly is that some of
the more important samples could end up being removed. The method can be used
for both binary and multi-class classification problems.
2.1.2 Pseudo-Code
Inputs:
• X train: Feature matrix of the training set
• y train: Corresponding labels of the training set
• ratio: Ratio of majority class samples to keep (e.g., 1.0 means keep all, 0.5 means
keep half)
Output:
• X resampled: Resampled feature matrix
• y resampled: Corresponding resampled labels
1. Count the number of samples in the majority class (N majority).
2. Calculate the number of majority class samples to keep after under-sampling:
3. N keep = round(ratio * N majority)
4. Initialize an empty list to store the indices of the samples to keep
5. For each unique class label c in y train:
6. If c is the minority class:
7. Add all indices corresponding to the minority class samples to the list of indices
to keep.
8. Else (c is the majority class):
9. Randomly select N keep indices from the majority class samples and add them
to the list of indices to keep.
10. Extract the corresponding feature matrix and labels using the selected indices.
11. Return the resampled feature matrix (X resampled) and corresponding labels (y
resampled)
12. Stop algorithm
2.1.3 Complexity
Complexity of code: O(n + nmin + Nkeep ), where n is the total number of samples,
nmin is the number of minority class samples, and Nkeep is the number of majority
class samples to keep after undersampling.
5
2.1.4 Library and Execution example
To perform random undersampling in Python, we can use the RandomUnderSam-
pler class from the imbalanced-learn library.
• If using NearMiss-1:
– Compute the distance to the N closest samples in the majority class.
– Keep track of the N closest distances.
• If using NearMiss-2:
– Compute the distance to the N farthest samples in the majority class.
– Keep track of the N farthest distances.
• If using NearMiss-3:
– Keep the M nearest neighbors for each negative sample.
– For each negative sample, compute the average distance to its M nearest
neighbors.
– Select positive samples based on the average distance to the N nearest
neighbors.
3. Select the majority class samples based on the computed distances to achieve
class balance.
2.2.2 Complexity
Complexity Analysis
• Definitions:
– nmin : Number of minority class samples.
6
– nmaj : Number of majority class samples.
– N : Number of nearest/farthest neighbors to consider.
– M : Number of nearest neighbors for each majority class sample in NearMiss-
3.
• NearMiss-1:
– Compute distance to all majority class samples: O(nmaj )
– Sort distances: O(nmaj log nmaj )
– Select the N closest samples: O(N )
– Total for one minority sample: O(nmaj +nmaj log nmaj +N ) ≈ O(nmaj log nmaj )
– Total for all minority samples: nmin ×O(nmaj log nmaj ) = O(nmin nmaj log nmaj )
• NearMiss-2:
– Compute distance to all majority class samples: O(nmaj )
– Sort distances: O(nmaj log nmaj )
– Select the N farthest samples: O(N )
– Total for one minority sample: O(nmaj +nmaj log nmaj +N ) ≈ O(nmaj log nmaj )
– Total for all minority samples: nmin ×O(nmaj log nmaj ) = O(nmin nmaj log nmaj )
• NearMiss-3:
– For each majority class sample, compute distance to all other majority class
samples: O(n2maj )
– Sort distances and select M nearest neighbors: O(n2maj log nmaj )
– Compute the average distance to M nearest neighbors for each majority
sample: O(nmaj M )
– Sort majority samples based on average distances: O(nmaj log nmaj )
– Select the majority samples with the smallest average distance to N nearest
neighbors: O(N )
– Total for NearMiss-3: O(n2maj +n2maj log nmaj +nmaj M +nmaj log nmaj +N ) ≈
O(n2maj log nmaj )
Summary of Complexities
• NearMiss-1 and NearMiss-2: O(nmin nmaj log nmaj )
• NearMiss-3: O(n2maj log nmaj )
2.2.3 Library
To perform Near Miss undersampling techniques in Python, we can use the Near
miss class from the imbalanced-learn library.
7
Figure 3: Execution example of Near-miss
2.3.2 Pseudo-code
The pseudo code for the Tomek Links undersampling algorithm can be outlined as
follows:
8
Algorithm 1 Finding Tomek Links
1: Input: Dataset S
2: Output: Set of Tomek Links
3: for i = 1 to |S| do
4: for j = 1 to |S| do
5: if i ̸= j and yi ̸= yj then
6: Calculate the Euclidean distance dij between xi and xj
7: tomek ← true
8: for all k ∈ S do
9: if yi = yk or yj = yk then
10: if dik < dij or djk < dij then
11: tomek ← f alse
12: Break inner loop
13: end if
14: end if
15: end for
16: if tomek is true then
17: Add (xi , xj ) to the set of Tomek Links
18: end if
19: end if
20: end for
21: end for
22: return Set of Tomek Links
This step helps in cleaning the dataset by eliminating ambiguous or noisy instances
near the class boundary.
2.3.3 Complexity
Let N be the number of samples, M be the number of samples from the other class,
and K be the number of Tomek Links found.
Identifying Tomek Links:
This step has a time complexity ranging from O(N M ) to O(N log M ), depending
on the method used for nearest neighbor search.
Removing Tomek Links:
This step has a time complexity proportional to the number of Tomek Links found,
which is O(K) in the worst case
2.4.1 Pseudo-Code
9
Algorithm 2 KNN Order Algorithm
1: Input:
2: Smaj: The number of examples in the majority class.
3: P : The percentage of undersampling.
4: k: The number of neighbors.
5: Output:
6: Smaj − R // The subset of the majority class
7: BEGIN
8: for i = 1 to l do
9: Find k nearest neighbors for ith element of Smaj. Save indexes of the neighbors
and distances between them and the analyzed example in the subsequent rows
of exT oRemove matrix;
10: end for
11: exT oRemove ← matrix (nrow=l × k, ncol=2)
12: exT oRemove ← exT oRemove[order(exT oRemove[, 2]), ] // Sorting in terms of
ascending distance
13: exT oRemove ← exT oRemove[!duplicated(exT oRemove[, 1]), ] // Removal of
repetitive indexes
14: Z = P × l // The number of examples to be removed from Smaj
15: if (nrow(exT oRemove) ≥ Z) then
16: R ← exT oRemove[1 : Z, 1]
17: else
18: R ← exT oRemove[, 1]
19: end if
20: return Smaj − R
21: END
2.4.2 Complexity
Finding Nearest Neighbors:
• For each instance in the majority class (of size l), the algorithm finds its k nearest
neighbors.
• This operation is performed l times.
• The complexity of finding k nearest neighbors for a single instance is O(k).
Sorting:
• After collecting the nearest neighbors for all instances in the majority class, the
algorithm sorts them based on ascending distance.
• Sorting l × k neighbor-distance pairs has a complexity of O(l × k log(l × k)).
Post-processing:
• The algorithm performs additional operations like removing duplicate indexes
and determining the number of examples to be removed (Z).
• These operations typically have linear complexity and can be considered as O(l×
k).
Overall Complexity:
• Considering the complexities of the main operations (finding nearest neighbors,
sorting, and post-processing), the overall complexity of the algorithm can be
summarized as follows:
– Finding Nearest Neighbors: O(l × k)
– Sorting: O(l × k log(l × k))
– Post-processing: O(l × k)
10
2.5 CNN( Condensed Nearest Neighbor method)
The Condensed Nearest Neighbor (CNN) method is a data reduction technique used
in the k-Nearest Neighbors (k-NN) classification algorithm. The CNN algorithm is de-
signed to reduce the data set for k-NN classification by selecting a subset of prototypes
from the training data that can classify examples almost as accurately as 1NN does
with the whole data set. The algorithm works iteratively by scanning all elements
of the training set, looking for an element whose nearest prototype from the set of
prototypes has a different label than the element itself. The element is then removed
from the training set and added to the set of prototypes. This process is repeated
until no more prototypes are added to the set The examples that are not prototypes
are called ”absorbed” points
2.5.1 Pseudo-Code
2.5.2 Complexity
The complexity of the Condensed Nearest Neighbor (CNN) algorithm can be summa-
rized as follows:
Overall, the complexity of the CNN algorithm can be considered to be linear with
respect to the number of examples in the majority class, O(n), where n is the number
of examples in the majority class.
11
Figure 5: Execution Example of CNN
• The instance belonging to the majority class, which is nearest to the cluster
centroid in the feature space, is considered to be the most important instance.
12
2.6.2 Pseudo-Code
13
3 OverSampling
Oversampling is a technique used in machine learning and data analysis to address
the issue of imbalanced datasets, where one class significantly outnumbers the other.
This imbalance can lead to biased models that perform poorly on the minority class.
Oversampling involves increasing the number of instances in the minority class to
balance the dataset. This can be done through various methods, including random
oversampling, where instances are duplicated, and synthetic oversampling, where new
instances are generated. Oversampling is used in various fields to address the issue
of imbalanced datasets, including medical diagnosis for improving disease detection
models, fraud detection in financial services to enhance the model’s ability to identify
fraudulent transactions, social network analysis to better predict friendships, customer
churn prediction in customer relationship management to improve customer retention
strategies, and speech recognition to improve the model’s ability to recognize rare
words.
3.1.2 Pseudo-Code
3.1.3 Complexity
The time complexity of random oversampling is primarily influenced by the process
of duplicating examples from the minority class. If we consider the operation of
duplicating a single example as a constant time operation, the time complexity of
oversampling a dataset with n examples, where n is the number of examples in the
14
minority class, would be O(n). This is because for each example in the minority class,
we perform a constant time operation to duplicate it.
15
3.2.2 Pseudo-Code
3.2.3 Complexity
The process of Synthetic Minority Oversampling Technique involves two main steps.
The time complexity of finding nearest neighbors is typically O(n log n) for efficient
nearest neighbor search algorithms like k-d trees or ball trees, where n is the number of
instances in the dataset. Generating synthetic instances is generally considered to have
a constant time complexity, O(1), as it involves a fixed number of operations for each
synthetic instance. Therefore, the overall time complexity of SMOTE is dominated by
the nearest neighbor search, making it O(n log n) for the search operation plus O(1)
for generating each synthetic instance.
16
Figure 9: Synthetic Minority Oversampling Technique Python execution.
3.3.2 Pseudo-Code
Training dataset Dtr with m samples {xi , yi }, i = 1, . . . , m, where xi is an instance
in the n-dimensional feature space X and yi ∈ Y = {1, −1} is the class identity
label associated with xi . Define ms and ml as the number of minority class examples
and the number of majority class examples, respectively. Therefore, ms ≤ ml and
ms + ml = m.
17
Algorithm 7 Adaptive Synthetic Sampling Technique (ADSYN)
1: Calculate the degree of class imbalance: d = ms ml where d ∈ (0, 1].
2: if d < dth (dth is a preset threshold for the maximum tolerated degree of class
imbalance ratio): then
3: Calculate the number of synthetic data examples that need to be generated for
the minority class: G = (ml − ms) × β. where β ∈ [0, 1] is a parameter used to
specify the desired balance level after generation of the synthetic data. β = 1
means a fully balanced dataset is created after the generation process.
4: for each example xi ∈ minority class do
5: Find K nearest neighbors of xi based on the Euclidean distance in n-
dimensional space.
6: Calculate the ratio ri defined as ∆ K , where ∆i is the number of examples in
i
3.3.3 Complexity
To analyze the time complexity of the Adaptive Synthetic Sampling Technique (ADSYN)
algorithm, we break down its main steps. Firstly, calculating the degree of class im-
balance involves constant time complexity. Similarly, checking the condition and cal-
culating the number of synthetic data examples also take constant time. For each
minority class example, finding the K nearest neighbors and calculating the ratio ri
both require O(m) time, where m is the total number of examples. Normalizing ri and
calculating the number of synthetic data examples ĝi are constant time operations.
Generating synthetic data examples for each minority example takes O(m) time due
to the loop iterating over all minority examples. Finally, combining the feature matrix
and labels takes O(m) time. Therefore, the overall time complexity of the algorithm
is O(m), where m is the total number of examples in the dataset.
18
3.3.4 Library and Execution example
To implement Adaptive Synthetic Sampling Technique in Python, we can utilize the
ADSYN class from the imbalanced-learn library. This process involves several steps:
19
3.4.2 Pseudo-Code
3.4.3 Complexity
To analyze the complexity of the Borderline-SMOTE algorithm, we consider its main
steps. Firstly, identifying borderline instances involves calculating distances to m
nearest neighbors for each minority class example, resulting in O((pnum + nnum ) · d ·
log(pnum + nnum )) complexity, where pnum and nnum are the numbers of minority and
majority class examples, and d is the feature space dimensionality. Secondly, marking
borderline instances has a time complexity of O(pnum ). Finally, generating synthetic
examples entails calculating distances to k nearest neighbors for each borderline in-
stance and creating s synthetic examples, leading to O(pnum ·k ·s·d) complexity. Thus,
the overall complexity is O((pnum + nnum ) · d · log(pnum + nnum ) + pnum + pnum · k · s · d).
3. Fitting and applying the oversampling strategy to the dataset using the fit_-
resample method of the BorderlineSMOTE instance. This method returns the
resampled feature matrix and the corresponding resampled target vector, effec-
tively balancing the class distribution.
20
Figure 11: Borderline Synthetic Minority Oversampling
21
4 Combination of Techniques
4.1 Combination of Oversampling and UnderSampling
Although using either an oversampling or undersampling technique alone can be effec-
tive on a training dataset, research has shown that combining both types often results
in better model performance on the transformed dataset.
Some of the most widely used and implemented combinations of data sampling
methods include:
• SMOTE and Random Undersampling
• SMOTE and Tomek Links: This approach uses SMOTE to oversample the
minority class, followed by the deletion of Tomek Links to clean the dataset and
refine class boundaries.
• SMOTE and Edited Nearest Neighbors (ENN) Rule: This method ap-
plies SMOTE, then removes examples misclassified by a KNN model (ENN),
further cleaning the dataset and enhancing its quality.
Combining SMOTE with these undersampling techniques helps balance the dataset
and reduces noise, leading to improved model performance.
22
4.2 Combination of K-Means and SMOTE
This method proposed to employ the simple and popular k-means clustering algorithm
in conjunction with SMOTE oversampling in order to rebalance skewed datasets. It
manages to avoid the generation of noise by oversampling only in safe areas. Moreover,
its focus is placed on both betweenclass imbalance and within-class imbalance, com-
bating the small disjuncts problem by inflating sparse minority areas. The method
is easily implemented due to its simplicity and the widespread availability of both
k-means and SMOTE. It is uniquely different from related methods not only due to
its low complexity but also because of its effective approach to distributing synthetic
samples based on cluster density.
Figure 12: Algorithm explaining the combination of both K-means and SMOTE tech-
niques
23
4.2.1 Pseudo-Code
24
Part II
Evaluation Metrics for imbalanced
classification
1 Introduction
Classification accuracy is a metric that summarizes the performance of a classification
model as the number of correct predictions divided by the total number of predictions.
Correct Predictions
Accuracy = Total Predictions
25
C
X
WeightedBalancedAccuracy = wi × Accuracyi
i=1
where C is the total number of classes, wi is the weight for class i, and Accuracyi
is the accuracy of class i.
In most imbalanced data use cases, the rare class is more important. One common
formulation for selecting class weights is the Normalized Inverse Class Frequency:
1
wi = PC
fi × j=1 fj
The Weighted Balanced Accuracy ranges from 0 to 1, with 1 representing optimal
performance and 0 representing the worst performance.
A smaller value of beta gives more weight to Recall, while a large value of beta
gives lower weight to Recall.
The F-beta score reaches its optimal value at 1 and its worst value at 0. Generally,
if the positive/rare class is more important, then we use PR-AUC in combination with
F1/F2/F0.5/pF scores.
• If both False Negatives and False Positives are equally important then we use
F1-Score
• If False Positives are more important than False Negatives then we use F0.5-
Score
• If False Negatives are more important than False Negatives then we use F2-Score
• If the output is measured via Probabilities, then we use pF1/pF2/pF0.5 as per
the scenarios above
26
The False Positive Rate (FPR) in the context of ROC (Receiver Operating Char-
acteristic) curve is defined as:
False Positives
FPR =
False Positives + True Negatives
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering
the classification threshold classifies more items as positive, thus increasing both False
Positives and True Positives. For different threshold values we will get different TPR
and FPR. So, in order to visualise which threshold is best suited for the classifier we
plot the ROC curve.
distributions overlap, we introduce type 1 and type 2 errors. Depending upon the
threshold, we can minimize or maximize them. When AUC is 0.7, it means there is
a 70 % chance that the model will be able to distinguish between positive class and
negative class
27
This is the worst situation. When AUC is approximately 0.5, the model has no
discrimination capacity to distinguish between positive class and negative class. When
AUC is approximately 0, the model is actually reciprocating the classes. It means the
model is predicting a negative class as a positive class and vice versa.
28