0% found this document useful (0 votes)
47 views28 pages

Balancing Data

This article is talking about the different balancing techniques for resampling data

Uploaded by

Souhail Housni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views28 pages

Balancing Data

This article is talking about the different balancing techniques for resampling data

Uploaded by

Souhail Housni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Balancing Techniques &

Evaluation Metrics for


Imbalanced Data

HOUSNI Souhail NMILI Ikram

Advisor: Dr Mohamed Hosni


June 20, 2024

1
Contents

I Balancing Techniques 4
1 Introduction 4

2 Undersampling 4
2.1 Random-Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Technique introduction . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 Library and Execution example . . . . . . . . . . . . . . . . . . 6
2.2 Near miss Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Pseudo code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Tomek Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Technique Introduction and History . . . . . . . . . . . . . . . 7
2.3.2 Pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 KNN Order (K-nearest Neighbor Order) . . . . . . . . . . . . . . . . 9
2.4.1 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 CNN( Condensed Nearest Neighbor method) . . . . . . . . . . . . . . 11
2.5.1 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.3 Library and Execution Example . . . . . . . . . . . . . . . . . 12
2.6 Cluster Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 Technique Introduction and History . . . . . . . . . . . . . . . 12
2.6.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.3 Library and Execution example . . . . . . . . . . . . . . . . . . 13

3 OverSampling 14
3.1 Random Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Technique introduction and history . . . . . . . . . . . . . . . 14
3.1.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Library and Execution example . . . . . . . . . . . . . . . . . . 15
3.2 Synthetic Minority Oversampling . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Technique introduction and history . . . . . . . . . . . . . . . 15
3.2.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Library and Execution example . . . . . . . . . . . . . . . . . . 16
3.3 Adaptive Synthetic Sampling . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Technique introduction and history . . . . . . . . . . . . . . . 17
3.3.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.4 Library and Execution example . . . . . . . . . . . . . . . . . . 19
3.4 Borderline Synthetic Minority Oversampling . . . . . . . . . . . . . . . 19
3.4.1 Technique introduction and history . . . . . . . . . . . . . . . 19
3.4.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.4 Library and Execution example . . . . . . . . . . . . . . . . . . 20

2
4 Combination of Techniques 22
4.1 Combination of Oversampling and UnderSampling . . . . . . . . . . . 22
4.2 Combination of K-Means and SMOTE . . . . . . . . . . . . . . . . . . 23
4.2.1 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Library and Execution example . . . . . . . . . . . . . . . . . . 24

II Evaluation Metrics for imbalanced classification 25


1 Introduction 25

2 Metrics used in imbalanced learning: 25


2.1 Precision-Recall: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Weighted Balanced Accuracy . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 F-beta Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 ROC -Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

List of Figures
1 Fraud detection-Imbalanced data . . . . . . . . . . . . . . . . . . . . . 4
2 Execution example of RUS . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Execution example of Near-miss . . . . . . . . . . . . . . . . . . . . . . 8
4 Tomek Links figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Execution Example of CNN . . . . . . . . . . . . . . . . . . . . . . . 12
6 Cluster Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7 Execution example of Cluster Centroid for Undersampling . . . . . . . 13
8 Random OverSampling Python execution. . . . . . . . . . . . . . . . 15
9 Synthetic Minority Oversampling Technique Python execution. . . . . 17
10 Adaptive Synthetic Sampling Python execution. . . . . . . . . . . . . 19
11 Borderline Synthetic Minority Oversampling . . . . . . . . . . . . . . . 21
12 Algorithm explaining the combination of both K-means and SMOTE
techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
13 Implementation of Kmeans SMOTE cobination . . . . . . . . . . . . . 24
14 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
15 Example of AUC=100% . . . . . . . . . . . . . . . . . . . . . . . . . . 27
16 Example of AUC=70% . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
17 Example of AUC=50% . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
18 Example of AUC=0% . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3
Part I
Balancing Techniques
1 Introduction
Real world datasets commonly show the particularity to have a number of samples
of a given class under-represented compared to other classes. This imbalance gives
rise to the “class imbalance” problem (Prati et al., 2009) (or “curse of imbalanced
datasets”) which is the problem of learning a concept from the class that has a small
number of samples. The class imbalance problem has been encountered in multiple
areas such as telecommunication managements, bioinformatics, fraud detection, and
medical diagnosis, and has been considered one of the top 10 problems in data mining
and pattern recognition (Yang and Wu, 2006; Rastgoo et al., 2016). Imbalanced

Figure 1: Fraud detection-Imbalanced data

data typically refers to a problem with classification problems where the classes are
not represented equally, including binary classification problems as well as multi-class
classification problems1. For multi-classification problem, the category with more
data samples is called majority category, while the category with less data samples
is called minority category. For binary classification problems, categories with more
data samples are called negative samples, and categories with less data samples are
called positive samples2.

2 Undersampling
Undersampling is one of several methods of dealing with imbalanced data. The idea
of undersampling is to balance the data by filter some of the overrepresented data to
have around the same number of examples for each class. Undersampling is a quite
fast method when used for imbalanced data due to a lot of majority class examples
are ignored. The downside of the method is some of the useful information can be
within those ignored examples.

2.1 Random-Undersampling
2.1.1 Technique introduction
The method used in Random Undersampling is to randomly remove samples of the
majority class to even the samples with the minority class (He & Ma, 2013). This
method is easy to implement and also fast to execute, which is good for very large

4
and complex datasets. A downside with removing samples randomly is that some of
the more important samples could end up being removed. The method can be used
for both binary and multi-class classification problems.

2.1.2 Pseudo-Code
Inputs:
• X train: Feature matrix of the training set
• y train: Corresponding labels of the training set
• ratio: Ratio of majority class samples to keep (e.g., 1.0 means keep all, 0.5 means
keep half)
Output:
• X resampled: Resampled feature matrix
• y resampled: Corresponding resampled labels
1. Count the number of samples in the majority class (N majority).
2. Calculate the number of majority class samples to keep after under-sampling:
3. N keep = round(ratio * N majority)
4. Initialize an empty list to store the indices of the samples to keep
5. For each unique class label c in y train:
6. If c is the minority class:
7. Add all indices corresponding to the minority class samples to the list of indices
to keep.
8. Else (c is the majority class):
9. Randomly select N keep indices from the majority class samples and add them
to the list of indices to keep.
10. Extract the corresponding feature matrix and labels using the selected indices.
11. Return the resampled feature matrix (X resampled) and corresponding labels (y
resampled)
12. Stop algorithm

2.1.3 Complexity

1. Counting majority class samples: O(n)


2. Calculating samples to keep: O(1)
3. Initializing empty list: O(1)
4. Looping through unique class labels: O(n)
5. Adding minority class indices: O(nmin )
6. Randomly selecting majority class indices: O(Nkeep )
7. Extracting features and labels: O(Nkeep )
8. Returning resampled data: O(1)

Complexity of code: O(n + nmin + Nkeep ), where n is the total number of samples,
nmin is the number of minority class samples, and Nkeep is the number of majority
class samples to keep after undersampling.

5
2.1.4 Library and Execution example
To perform random undersampling in Python, we can use the RandomUnderSam-
pler class from the imbalanced-learn library.

Figure 2: Execution example of RUS

2.2 Near miss Methods


There are three different versions of the NearMiss family. The first method being
NearMiss-1 which selects examples from the majority class that are close to three of the
closests examples from the minority class and removes them. The NearMiss-2 method
works similar to NearMiss-1. However, instead of choosing the closests examples, it
selects examples from the majority class that have the smallest average distance to the
three furthest examples from the minority class In the NearMiss-3 method, a given
number of the closest majority examples are selected for each minority example. The
strength of NearMiss-3 method is how it guarantees that every minority example is
surrounded by some majority samples

2.2.1 Pseudo code


1. Initialize the NearMiss algorithm with the desired version (NearMiss-1, NearMiss-
2, NearMiss-3).
2. For each sample in the minority class:

• If using NearMiss-1:
– Compute the distance to the N closest samples in the majority class.
– Keep track of the N closest distances.
• If using NearMiss-2:
– Compute the distance to the N farthest samples in the majority class.
– Keep track of the N farthest distances.
• If using NearMiss-3:
– Keep the M nearest neighbors for each negative sample.
– For each negative sample, compute the average distance to its M nearest
neighbors.
– Select positive samples based on the average distance to the N nearest
neighbors.
3. Select the majority class samples based on the computed distances to achieve
class balance.

2.2.2 Complexity

Complexity Analysis
• Definitions:
– nmin : Number of minority class samples.

6
– nmaj : Number of majority class samples.
– N : Number of nearest/farthest neighbors to consider.
– M : Number of nearest neighbors for each majority class sample in NearMiss-
3.
• NearMiss-1:
– Compute distance to all majority class samples: O(nmaj )
– Sort distances: O(nmaj log nmaj )
– Select the N closest samples: O(N )
– Total for one minority sample: O(nmaj +nmaj log nmaj +N ) ≈ O(nmaj log nmaj )
– Total for all minority samples: nmin ×O(nmaj log nmaj ) = O(nmin nmaj log nmaj )
• NearMiss-2:
– Compute distance to all majority class samples: O(nmaj )
– Sort distances: O(nmaj log nmaj )
– Select the N farthest samples: O(N )
– Total for one minority sample: O(nmaj +nmaj log nmaj +N ) ≈ O(nmaj log nmaj )
– Total for all minority samples: nmin ×O(nmaj log nmaj ) = O(nmin nmaj log nmaj )
• NearMiss-3:
– For each majority class sample, compute distance to all other majority class
samples: O(n2maj )
– Sort distances and select M nearest neighbors: O(n2maj log nmaj )
– Compute the average distance to M nearest neighbors for each majority
sample: O(nmaj M )
– Sort majority samples based on average distances: O(nmaj log nmaj )
– Select the majority samples with the smallest average distance to N nearest
neighbors: O(N )
– Total for NearMiss-3: O(n2maj +n2maj log nmaj +nmaj M +nmaj log nmaj +N ) ≈
O(n2maj log nmaj )

Summary of Complexities
• NearMiss-1 and NearMiss-2: O(nmin nmaj log nmaj )
• NearMiss-3: O(n2maj log nmaj )

2.2.3 Library
To perform Near Miss undersampling techniques in Python, we can use the Near
miss class from the imbalanced-learn library.

2.3 Tomek Links


2.3.1 Technique Introduction and History
Tomek Links is one of a modification from Condensed Nearest Neighbors (CNN, not
to be confused with Convolutional Neural Network) undersampling technique that is
developed by Tomek (1976). Unlike the CNN method that are only randomly select
the samples with its k nearest neighbors from the majority class that wants to be
removed, the Tomek Links method uses the rule to selects the pair of observation (say,
a and b) that are fulfilled these properties:

7
Figure 3: Execution example of Near-miss

• The observation a’s nearest neighbor is b.


• The observation b’s nearest neighbor is a.
• Observation a and b belong to a different class. That is, a and b belong to the
minority and majority class (or vice versa), respectively

Figure 4: Tomek Links figure

Mathematically, it can be expressed as follows. Let d(xi , xj ) denotes the Euclidean


distance between xi and xj , where xi denotes sample that belongs to the minority
class and xj denotes sample that belongs to the majority class. If there is no sample
xk satisfies the following condition:
d(xi , xk ) < d(xi , xj ), or d(xj , xk ) < d(xi , xj ) then the pair of (xi , xj ) is a Tomek Link.

2.3.2 Pseudo-code
The pseudo code for the Tomek Links undersampling algorithm can be outlined as
follows:

8
Algorithm 1 Finding Tomek Links
1: Input: Dataset S
2: Output: Set of Tomek Links
3: for i = 1 to |S| do
4: for j = 1 to |S| do
5: if i ̸= j and yi ̸= yj then
6: Calculate the Euclidean distance dij between xi and xj
7: tomek ← true
8: for all k ∈ S do
9: if yi = yk or yj = yk then
10: if dik < dij or djk < dij then
11: tomek ← f alse
12: Break inner loop
13: end if
14: end if
15: end for
16: if tomek is true then
17: Add (xi , xj ) to the set of Tomek Links
18: end if
19: end if
20: end for
21: end for
22: return Set of Tomek Links

This step helps in cleaning the dataset by eliminating ambiguous or noisy instances
near the class boundary.

2.3.3 Complexity
Let N be the number of samples, M be the number of samples from the other class,
and K be the number of Tomek Links found.
Identifying Tomek Links:
This step has a time complexity ranging from O(N M ) to O(N log M ), depending
on the method used for nearest neighbor search.
Removing Tomek Links:
This step has a time complexity proportional to the number of Tomek Links found,
which is O(K) in the worst case

2.4 KNN Order (K-nearest Neighbor Order)


There are many other methods based on the KNN algorithm e.g. KNN-Und (KNN
Undersampling), CNN (Condensed Nearest Neighbor), OSS (One-sided selection) and
so on. Generally,heuristic methods of undersampling, also called focused or informed
ones, unlike random undersampling try to reject the least significant examples of the
majority class and thus minimize risk of losing important information.In our paper we
will evoke the KNN order method for undersampling based on KNN.

2.4.1 Pseudo-Code

9
Algorithm 2 KNN Order Algorithm
1: Input:
2: Smaj: The number of examples in the majority class.
3: P : The percentage of undersampling.
4: k: The number of neighbors.
5: Output:
6: Smaj − R // The subset of the majority class
7: BEGIN
8: for i = 1 to l do
9: Find k nearest neighbors for ith element of Smaj. Save indexes of the neighbors
and distances between them and the analyzed example in the subsequent rows
of exT oRemove matrix;
10: end for
11: exT oRemove ← matrix (nrow=l × k, ncol=2)
12: exT oRemove ← exT oRemove[order(exT oRemove[, 2]), ] // Sorting in terms of
ascending distance
13: exT oRemove ← exT oRemove[!duplicated(exT oRemove[, 1]), ] // Removal of
repetitive indexes
14: Z = P × l // The number of examples to be removed from Smaj
15: if (nrow(exT oRemove) ≥ Z) then
16: R ← exT oRemove[1 : Z, 1]
17: else
18: R ← exT oRemove[, 1]
19: end if
20: return Smaj − R
21: END

2.4.2 Complexity
Finding Nearest Neighbors:
• For each instance in the majority class (of size l), the algorithm finds its k nearest
neighbors.
• This operation is performed l times.
• The complexity of finding k nearest neighbors for a single instance is O(k).
Sorting:
• After collecting the nearest neighbors for all instances in the majority class, the
algorithm sorts them based on ascending distance.
• Sorting l × k neighbor-distance pairs has a complexity of O(l × k log(l × k)).
Post-processing:
• The algorithm performs additional operations like removing duplicate indexes
and determining the number of examples to be removed (Z).
• These operations typically have linear complexity and can be considered as O(l×
k).
Overall Complexity:
• Considering the complexities of the main operations (finding nearest neighbors,
sorting, and post-processing), the overall complexity of the algorithm can be
summarized as follows:
– Finding Nearest Neighbors: O(l × k)
– Sorting: O(l × k log(l × k))
– Post-processing: O(l × k)

10
2.5 CNN( Condensed Nearest Neighbor method)
The Condensed Nearest Neighbor (CNN) method is a data reduction technique used
in the k-Nearest Neighbors (k-NN) classification algorithm. The CNN algorithm is de-
signed to reduce the data set for k-NN classification by selecting a subset of prototypes
from the training data that can classify examples almost as accurately as 1NN does
with the whole data set. The algorithm works iteratively by scanning all elements
of the training set, looking for an element whose nearest prototype from the set of
prototypes has a different label than the element itself. The element is then removed
from the training set and added to the set of prototypes. This process is repeated
until no more prototypes are added to the set The examples that are not prototypes
are called ”absorbed” points

2.5.1 Pseudo-Code

Algorithm 3 Condensed Nearest Neighbor (CNN) Algorithm


Require: S: The original dataset with imbalanced class distribution.
Require: Smaj : Subset of S containing examples from the majority class.
Require: Smin : Subset of S containing examples from the minority class.
Ensure: S ′ : Condensed subset of Smaj .
1: Initialize S ′ as an empty set.
2: while examples can be added to S ′ without misclassification do
3: Randomly select an example x from Smaj and add it to S ′ .
4: Train a 1-nearest neighbor (1-NN) classifier on S ′ .
5: for each example xi in Smaj do
6: if xi is misclassified by the 1-NN classifier then
7: Add xi to S ′ .
8: end if
9: end for
10: end while
11: return S ′ as the condensed subset of Smaj .

2.5.2 Complexity
The complexity of the Condensed Nearest Neighbor (CNN) algorithm can be summa-
rized as follows:

1. Initialization: This step involves initializing an empty set S ′ , which has a


constant time complexity of O(1).
2. Iteration: The algorithm iterates until no more examples can be added to S ′
without misclassification. During each iteration, it randomly selects an example
from the majority class, adds it to S ′ , and trains a 1-nearest neighbor (1-NN)
classifier on S ′ . Then, for each example in the majority class, it checks whether
it is misclassified by the 1-NN classifier and adds it to S ′ if necessary. The
number of iterations depends on the dataset and the classification boundaries,
so it is typically represented as O(n), where n is the number of examples in the
majority class.

3. Return: Finally, the algorithm returns S ′ as the condensed subset of Smaj ,


which has a constant time complexity of O(1).

Overall, the complexity of the CNN algorithm can be considered to be linear with
respect to the number of examples in the majority class, O(n), where n is the number
of examples in the majority class.

11
Figure 5: Execution Example of CNN

2.5.3 Library and Execution Example

2.6 Cluster Centroids


2.6.1 Technique Introduction and History
The idea here is basically to remove the unimportant instance from the majority class.
To decide whether an instance is important or not, we use the concept of clustering
on the geometry of the feature space.Clustering is an unsupervised learning approach,
in which clusters are creating encircling data points belonging. We will use it only to
find cluster centroid that are obtained by averaging feature vector for all the features
over the data points.
After finding the cluster centroid of the majority class, we decide the following:
• The instance belonging to the cluster (majority class), which is farthest from the
cluster centroid in the feature space, is considered to be the most unimportant
instance.

• The instance belonging to the majority class, which is nearest to the cluster
centroid in the feature space, is considered to be the most important instance.

Figure 6: Cluster Centroids

12
2.6.2 Pseudo-Code

Algorithm 4 Cluster Centroids Undersampling Algorithm


Require: S: The original dataset with imbalanced class distribution.
Require: Smaj : Subset of S containing examples from the majority class.
Require: Smin : Subset of S containing examples from the minority class.
Ensure: S ′ : Undersampled dataset.
1: Initialize S ′ as an empty set.
2: Perform clustering on Smaj to obtain K clusters.
3: Calculate the centroid for each cluster.
4: for each centroid c in centroids do
5: Find the nearest example x in Smaj to c.
6: Add x to S ′ .
7: end for
8: return S ′ as the undersampled dataset.

2.6.3 Library and Execution example

Figure 7: Execution example of Cluster Centroid for Undersampling

13
3 OverSampling
Oversampling is a technique used in machine learning and data analysis to address
the issue of imbalanced datasets, where one class significantly outnumbers the other.
This imbalance can lead to biased models that perform poorly on the minority class.
Oversampling involves increasing the number of instances in the minority class to
balance the dataset. This can be done through various methods, including random
oversampling, where instances are duplicated, and synthetic oversampling, where new
instances are generated. Oversampling is used in various fields to address the issue
of imbalanced datasets, including medical diagnosis for improving disease detection
models, fraud detection in financial services to enhance the model’s ability to identify
fraudulent transactions, social network analysis to better predict friendships, customer
churn prediction in customer relationship management to improve customer retention
strategies, and speech recognition to improve the model’s ability to recognize rare
words.

3.1 Random Oversampling


3.1.1 Technique introduction and history
Random oversampling is a method to balance a dataset by increasing the number of
instances in the minority class. It involves randomly duplicating instances from the
minority class to match the number of instances in the majority class. This technique
is particularly useful in classification problems where the dataset is imbalanced.

3.1.2 Pseudo-Code

Algorithm 5 Random Oversampling Algorithm


1: Input: Feature matrix X, Labels y, Ratio Nkeep /Nremove
2: Output: Resampled feature matrix Xresampled and labels yresampled
3: Step 1: Count the number of samples in the majority class (Nmajority ).
4: Step 2: Calculate the number of majority class samples to keep after under-
sampling: Nkeep = round(ratio × Nmajority )
5: Step 3: Initialize an empty list to store the indices of the samples to keep.
6: for each unique class label c in y do
7: if c is the minority class then
8: Add all indices corresponding to the minority class samples to the list of
indices to keep.
9: else
10: Randomly select Nkeep indices from the majority class samples and add them
to the list of indices to keep.
11: end if
12: end for
13: Step 4: Extract the corresponding feature matrix and labels using the selected
indices.
14: Step 5: Return the resampled feature matrix Xresampled and corresponding labels
yresampled .
15: Stop algorithm.

3.1.3 Complexity
The time complexity of random oversampling is primarily influenced by the process
of duplicating examples from the minority class. If we consider the operation of
duplicating a single example as a constant time operation, the time complexity of
oversampling a dataset with n examples, where n is the number of examples in the

14
minority class, would be O(n). This is because for each example in the minority class,
we perform a constant time operation to duplicate it.

3.1.4 Library and Execution example


To implement random oversampling in Python, we can utilize the RandomOverSampler
class from the imbalanced-learn library. This process involves several steps:

1. Importing necessary libraries, such as RandomOverSampler from imbalanced-learn


and any other required libraries for handling the dataset.
2. Defining the dataset, which can be done using synthetic data generation func-
tions like make_classification from sklearn.datasets for demonstration
purposes.
3. Fitting and applying the oversampling strategy to the dataset using the fit_-
resample method of the RandomOverSampler instance. This method returns
the resampled feature matrix and the corresponding resampled target vector,
effectively balancing the class distribution.

Figure 8: Random OverSampling Python execution.

3.2 Synthetic Minority Oversampling


3.2.1 Technique introduction and history
Synthetic Minority Oversampling Technique (SMOTE) is a methode that create syn-
thetic instances of the minority class. These data points are created by assessing the
nearest neighbours for each of the minority sample and creating new synthetic in-
stances in the feature space until the minority class is balanced to the given ratio.this
techniaue was introduced by Nitesh Chawla and his colleagues in 2002

15
3.2.2 Pseudo-Code

Algorithm 6 Synthetic Minority Oversampling Technique (SMOTE)


1: Step 1: Identify the minority class samples in the dataset.
2: Step 2: Calculate the number of synthetic samples to generate for each minority
class sample: Nsynthetic = round(ratio × Nminority )
3: Step 3: Initialize an empty list to store the synthetic samples.
4: for each sample x in the minority class do
5: Find k nearest neighbors of x within the same class.
6: Randomly select one of the k nearest neighbors, xnn .
7: Generate Nsynthetic synthetic samples in the feature space between x and xnn
and add them to the list of synthetic samples.
8: end for
9: Step 4: Combine the original feature matrix with the synthetic samples.
10: Step 5: Combine the original labels with labels indicating the synthetic samples.
11: Step 6: Return the resampled feature matrix Xresampled and corresponding labels
yresampled .
12: Stop algorithm.

3.2.3 Complexity
The process of Synthetic Minority Oversampling Technique involves two main steps.
The time complexity of finding nearest neighbors is typically O(n log n) for efficient
nearest neighbor search algorithms like k-d trees or ball trees, where n is the number of
instances in the dataset. Generating synthetic instances is generally considered to have
a constant time complexity, O(1), as it involves a fixed number of operations for each
synthetic instance. Therefore, the overall time complexity of SMOTE is dominated by
the nearest neighbor search, making it O(n log n) for the search operation plus O(1)
for generating each synthetic instance.

3.2.4 Library and Execution example


To implement Synthetic Minority Oversampling Technique in Python, we can utilize
the SMOTE class from the imbalanced-learn library. This process involves several
steps:

1. Importing necessary libraries, such as SMOTE from imbalanced-learn and any


other required libraries for handling the dataset.
2. Defining the dataset, which can be done using synthetic data generation func-
tions like make_classification from sklearn.datasets for demonstration
purposes.
3. Fitting and applying the oversampling strategy to the dataset using the fit_-
resample method of the SMOTE instance. This method returns the resampled
feature matrix and the corresponding resampled target vector, effectively bal-
ancing the class distribution.

16
Figure 9: Synthetic Minority Oversampling Technique Python execution.

3.3 Adaptive Synthetic Sampling


3.3.1 Technique introduction and history
ADASYN (Adaptive Synthetic Sampling) is an extension of the Synthetic Minority
Over-sampling Technique (SMOTE), developed to address some of its limitations. It
was introduced in the paper ”ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced Learning” by Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li
in 2008. ADASYN improves upon SMOTE by generating more synthetic samples for
minority class instances that are harder to learn, meaning those that are located in
regions where the density of the minority class is low.

3.3.2 Pseudo-Code
Training dataset Dtr with m samples {xi , yi }, i = 1, . . . , m, where xi is an instance
in the n-dimensional feature space X and yi ∈ Y = {1, −1} is the class identity
label associated with xi . Define ms and ml as the number of minority class examples
and the number of majority class examples, respectively. Therefore, ms ≤ ml and
ms + ml = m.

17
Algorithm 7 Adaptive Synthetic Sampling Technique (ADSYN)
1: Calculate the degree of class imbalance: d = ms ml where d ∈ (0, 1].
2: if d < dth (dth is a preset threshold for the maximum tolerated degree of class
imbalance ratio): then
3: Calculate the number of synthetic data examples that need to be generated for
the minority class: G = (ml − ms) × β. where β ∈ [0, 1] is a parameter used to
specify the desired balance level after generation of the synthetic data. β = 1
means a fully balanced dataset is created after the generation process.
4: for each example xi ∈ minority class do
5: Find K nearest neighbors of xi based on the Euclidean distance in n-
dimensional space.
6: Calculate the ratio ri defined as ∆ K , where ∆i is the number of examples in
i

the K nearest neighbors of xi that belong to the majority class.


ri
7: Normalize ri according to r̂i = Pms so that r̂i is a density distribution.
i=1 ri
8: Calculate the number of synthetic data examples that need to be generated
for each minority example xi : ĝi = r̂i × G.
9: for j = 1 to ⌊ĝi ⌋ do
10: Randomly choose one minority data example, xzi , from the K nearest
neighbors for data xi .
11: Generate the synthetic data example: si = xi + (xzi − xi ) × λ, where
(xzi − xi ) is the difference vector in n-dimensional spaces, and λ ∈ [0, 1].
12: end for
13: end for
14: end if
15: Combine the original feature matrix with the synthetic samples.
16: Combine the original labels with labels indicating the synthetic samples.
17: Return the resampled feature matrix Xresampled and corresponding labels
yresampled .

To analyze the time complexity of the Adaptive Synthetic (ADSYN) Sampling


Technique algorithm, we break down its main steps. Firstly, calculating the degree of
class imbalance involves constant time complexity. Similarly, checking the condition
and calculating the number of synthetic data examples also take constant time. For
each minority class example, finding the K nearest neighbors and calculating the ratio
ri both require O(m) time, where m is the total number of examples. Normalizing ri
and calculating the number of synthetic data examples ĝi are constant time operations.
Generating synthetic data examples for each minority example takes O(m) time due
to the loop iterating over all minority examples. Finally, combining the feature matrix
and labels takes O(m) time. Therefore, the overall time complexity of the algorithm
is O(m), where m is the total number of examples in the dataset.

3.3.3 Complexity
To analyze the time complexity of the Adaptive Synthetic Sampling Technique (ADSYN)
algorithm, we break down its main steps. Firstly, calculating the degree of class im-
balance involves constant time complexity. Similarly, checking the condition and cal-
culating the number of synthetic data examples also take constant time. For each
minority class example, finding the K nearest neighbors and calculating the ratio ri
both require O(m) time, where m is the total number of examples. Normalizing ri and
calculating the number of synthetic data examples ĝi are constant time operations.
Generating synthetic data examples for each minority example takes O(m) time due
to the loop iterating over all minority examples. Finally, combining the feature matrix
and labels takes O(m) time. Therefore, the overall time complexity of the algorithm
is O(m), where m is the total number of examples in the dataset.

18
3.3.4 Library and Execution example
To implement Adaptive Synthetic Sampling Technique in Python, we can utilize the
ADSYN class from the imbalanced-learn library. This process involves several steps:

1. Importing necessary libraries, such as ADSYN from imbalanced-learn and any


other required libraries for handling the dataset.
2. Defining the dataset, which can be done using synthetic data generation func-
tions like make_classification from sklearn.datasets for demonstration
purposes.
3. Fitting and applying the oversampling strategy to the dataset using the fit_-
resample method of the ADSYN instance. This method returns the resampled
feature matrix and the corresponding resampled target vector, effectively bal-
ancing the class distribution.

Figure 10: Adaptive Synthetic Sampling Python execution.

3.4 Borderline Synthetic Minority Oversampling


3.4.1 Technique introduction and history
Borderline-SMOTE, an extension of the SMOTE algorithm introduced by Haibo He et
al. in 2005, addresses dataset imbalance by synthesizing examples focused specifically
on the borderline instances of the minority class. Unlike traditional oversampling tech-
niques that uniformly increase the minority class, Borderline-SMOTE identifies these
critical instances near the decision boundary and selectively oversamples them. This
approach aims to enhance the classifier’s ability to learn from challenging instances,
potentially leading to improved classification performance.

19
3.4.2 Pseudo-Code

Algorithm 8 Borderline-SMOTE Algorithm


1: Input: Minority class examples P , Majority class examples N , Number of nearest
neighbors k, Number of synthetic examples to generate s
2: Output: Synthetic minority class examples
3: for pi ∈ P do
4: Calculate m nearest neighbors of pi from P ∪ N
5: Calculate the number of majority class examples among the nearest neighbors,
denoted as m′
6: if m = m′ then
7: Mark pi as noise and continue
8: else if m/2 ≤ m′ < m then
9: Mark pi as ”DANGER”
10: end if
11: end for
12: Set DAN GER as the set of examples marked as ”DANGER”
13: for p′i ∈ DAN GER do
14: Calculate k nearest neighbors of p′i from P
15: Randomly select s nearest neighbors from k nearest neighbors of p′i
16: for pj ∈ selected nearest neighbors do
17: Calculate differences difj between p′i and pj
18: Generate s new synthetic minority examples between p′i and pj :
19: synthetic = p′i + difj × rj , where rj is a random number between 0 and 1
20: end for
21: end for

3.4.3 Complexity
To analyze the complexity of the Borderline-SMOTE algorithm, we consider its main
steps. Firstly, identifying borderline instances involves calculating distances to m
nearest neighbors for each minority class example, resulting in O((pnum + nnum ) · d ·
log(pnum + nnum )) complexity, where pnum and nnum are the numbers of minority and
majority class examples, and d is the feature space dimensionality. Secondly, marking
borderline instances has a time complexity of O(pnum ). Finally, generating synthetic
examples entails calculating distances to k nearest neighbors for each borderline in-
stance and creating s synthetic examples, leading to O(pnum ·k ·s·d) complexity. Thus,
the overall complexity is O((pnum + nnum ) · d · log(pnum + nnum ) + pnum + pnum · k · s · d).

3.4.4 Library and Execution example


To implement Borderline-SMOTE in Python, we can utilize the BorderlineSMOTE
class from the imbalanced-learn library. This process involves several steps:

1. Importing necessary libraries, such as BorderlineSMOTE from imbalanced-learn


and any other required libraries for handling the dataset.
2. Defining the dataset, which can be done using synthetic data generation func-
tions like make_classification from sklearn.datasets for demonstration
purposes.

3. Fitting and applying the oversampling strategy to the dataset using the fit_-
resample method of the BorderlineSMOTE instance. This method returns the
resampled feature matrix and the corresponding resampled target vector, effec-
tively balancing the class distribution.

20
Figure 11: Borderline Synthetic Minority Oversampling

21
4 Combination of Techniques
4.1 Combination of Oversampling and UnderSampling
Although using either an oversampling or undersampling technique alone can be effec-
tive on a training dataset, research has shown that combining both types often results
in better model performance on the transformed dataset.
Some of the most widely used and implemented combinations of data sampling
methods include:
• SMOTE and Random Undersampling

• SMOTE and Tomek Links


• SMOTE and Edited Nearest Neighbors Rule
Let’s take a closer look at these methods.
SMOTE, or Synthetic Minority Over-sampling Technique, is one of the most pop-
ular and widely used oversampling techniques. It is commonly paired with various
undersampling methods.
SMOTE and Random Undersampling: The simplest combination involves
pairing SMOTE with random undersampling. The method was originally suggested
to outperform using SMOTE alone.
SMOTE with Advanced Undersampling Methods: Often, SMOTE is com-
bined with an undersampling method that selectively removes examples from the
dataset after SMOTE is applied. This editing step targets both the minority and
majority classes, removing noisy points along the class boundary and improving clas-
sifier performance on the transformed dataset.
Two popular combinations include:

• SMOTE and Tomek Links: This approach uses SMOTE to oversample the
minority class, followed by the deletion of Tomek Links to clean the dataset and
refine class boundaries.
• SMOTE and Edited Nearest Neighbors (ENN) Rule: This method ap-
plies SMOTE, then removes examples misclassified by a KNN model (ENN),
further cleaning the dataset and enhancing its quality.
Combining SMOTE with these undersampling techniques helps balance the dataset
and reduces noise, leading to improved model performance.

22
4.2 Combination of K-Means and SMOTE
This method proposed to employ the simple and popular k-means clustering algorithm
in conjunction with SMOTE oversampling in order to rebalance skewed datasets. It
manages to avoid the generation of noise by oversampling only in safe areas. Moreover,
its focus is placed on both betweenclass imbalance and within-class imbalance, com-
bating the small disjuncts problem by inflating sparse minority areas. The method
is easily implemented due to its simplicity and the widespread availability of both
k-means and SMOTE. It is uniquely different from related methods not only due to
its low complexity but also because of its effective approach to distributing synthetic
samples based on cluster density.

Figure 12: Algorithm explaining the combination of both K-means and SMOTE tech-
niques

23
4.2.1 Pseudo-Code

Algorithm 9 Cluster-Based SMOTE Algorithm


1: Input: Dataset X, Number of clusters k, Imbalance ratio threshold irt, Number
of nearest neighbors for SMOTE knn, Total synthetic examples to generate kn
2: Output: Synthetic minority class examples
3: // Step 1: Cluster the input space and filter clusters with more minority instances
than majority instances.
4: clusters ← kmeans(X, k)
5: f ilteredClusters ← ∅
6: for c ∈ clusters do
7: imbalanceRatio ← majorityCount(c)+1
minorityCount(c)+1
8: if imbalanceRatio < irt then
9: f ilteredClusters ← f ilteredClusters ∪ {c}
10: end if
11: end for
12: // Step 2: For each filtered cluster, compute the sampling weight based on its
minority density.
13: for f ∈ f ilteredClusters do
14: averageM inorityDistance(f ) ← mean(euclideanDistances(f ))
minorityCount(f )
15: densityF actor(f ) ← averageM inorityDistance(f )
16: sparsityF actor(f ) ← densityF1actor(f )
17: end for
P
18: sparsitySum ← f ∈f ilteredClusters sparsityFactor(f )
19: for f ∈ f ilteredClusters do
20: samplingW eight(f ) ← sparsityFactor(f
sparsitySum
)

21: end for


22: // Step 3: Oversample each filtered cluster using SMOTE. The number of samples
to be generated is computed using the sampling weight.
23: generatedSamples ← ∅
24: for f ∈ f ilteredClusters do
25: numberOf Samples ← ⌊kn × samplingWeight(f )⌋
26: generatedSamples ← generatedSamples ∪
SMOTE(f, numberOf Samples, knn)
27: end for
28: return generatedSamples

4.2.2 Library and Execution example


The imblearn library provides an easy-to-use implementation for combining k-means
clustering with SMOTE:

Figure 13: Implementation of Kmeans SMOTE cobination

24
Part II
Evaluation Metrics for imbalanced
classification
1 Introduction
Classification accuracy is a metric that summarizes the performance of a classification
model as the number of correct predictions divided by the total number of predictions.
Correct Predictions
Accuracy = Total Predictions

Achieving 90 percent classification accuracy, or even 99 percent classification ac-


curacy, may be trivial on an imbalanced classification problem. Consider the case of
an imbalanced dataset with a 1:100 class imbalance. Blind guess will give us a 99 %
accuracy score (by betting on majority class).
The rule of thumb is: accuracy never helps in imbalanced dataset.
In this case, other alternative evaluation metrics can be applied such as:

• Precision/Specificity: how many selected instances are relevant.


• Recall/Sensitivity: how many relevant instances are selected. F1 score: har-
monic mean of precision and recall.
• MCC: correlation coefficient between the observed and predicted binary classi-
fications.
• AUC: relation between true-positive rate and false positive rate.

2 Metrics used in imbalanced learning:


2.1 Precision-Recall:
Precision is the ratio of correct positive predictions over all positive predictions. It
does not comment on how many real positive class samples are predicted as negative
class (False Negatives). Whereas, recall provides an indication of missed positive
predictions. If the number of negative samples is very large (a.k.a imbalanced dataset),
the false positive rate increases more slowly. This is because the true negatives (in
the FPR denominator F P + T N ) would probably be very high and make this metric
smaller. Precision, however, is not affected by a large number of negative samples. It
measures the number of true positives out of the samples predicted as positives, and
is calculated as: Precision is defined as:
True Positives
Precision =
True Positives + False Positives
Recall, also known as sensitivity or true positive rate, measures the proportion of
actual positives that are correctly identified by the model. It is defined as:
True Positives
Recall =
True Positives + False Negatives

2.2 Weighted Balanced Accuracy


Weighted Balanced Accuracy (WBA) is a thresholded metric, adjusting the Accu-
racy metric as per class weights, where classes with lower class weights receive higher
weightage. It can be formulated as a generalized form of Balanced Accuracy (BA):

25
C
X
WeightedBalancedAccuracy = wi × Accuracyi
i=1
where C is the total number of classes, wi is the weight for class i, and Accuracyi
is the accuracy of class i.
In most imbalanced data use cases, the rare class is more important. One common
formulation for selecting class weights is the Normalized Inverse Class Frequency:
1
wi = PC
fi × j=1 fj
The Weighted Balanced Accuracy ranges from 0 to 1, with 1 representing optimal
performance and 0 representing the worst performance.

2.3 F-beta Score


The F-beta score is a very robust scoring mechanism for scoring both balanced and
unbalanced data use cases. It is the generalized form of the commonly used F1 score
with β value set to 1.
F-Beta takes into account both Precision and Recall and does a weighted Harmonic
mean between the two.
F-beta is also a thresholded metric, meaning, we have to first apply thresholds to
get actual binarized predictions.
Note that Precision calculates the percentage of correct predictions for the positive
class.
Recall is the percentage of correct predictions for the positive class out of all
positive predictions that could be made.
The F1 Score is calculated as the simple Harmonic Mean of Precision and Recall.
The generalized form i.e. the F-beta score weighs the contribution of Precision or
Recall to the metric.
(1+β 2 )×Precision×Recall
Fβ = β 2 ×Precision+Recall

A smaller value of beta gives more weight to Recall, while a large value of beta
gives lower weight to Recall.
The F-beta score reaches its optimal value at 1 and its worst value at 0. Generally,
if the positive/rare class is more important, then we use PR-AUC in combination with
F1/F2/F0.5/pF scores.
• If both False Negatives and False Positives are equally important then we use
F1-Score
• If False Positives are more important than False Negatives then we use F0.5-
Score
• If False Negatives are more important than False Negatives then we use F2-Score
• If the output is measured via Probabilities, then we use pF1/pF2/pF0.5 as per
the scenarios above

2.4 ROC -Curve


A receiver operating characteristic (ROC) curve is a graph showing the performance
of a classification model at all classification thresholds. The ROC curve plots two
parameters, being the True Positive Rate (TPR) and False Positive Rate (FPR). TPR
is also known as sensitivity, and FPR is one minus the specificity or true negative rate.
The True Positive Rate (TPR) relates to recall, and is calculated as follows:
True Positives
Recall =
True Positives + False Negatives

26
The False Positive Rate (FPR) in the context of ROC (Receiver Operating Char-
acteristic) curve is defined as:
False Positives
FPR =
False Positives + True Negatives
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering
the classification threshold classifies more items as positive, thus increasing both False
Positives and True Positives. For different threshold values we will get different TPR
and FPR. So, in order to visualise which threshold is best suited for the classifier we
plot the ROC curve.

Figure 14: ROC Curve

To compute the points in an ROC curve, we could evaluate a logistic regression


model many times with different classification thresholds, but this would be ineffi-
cient. Fortunately, there’s an efficient, sorting-based algorithm that can provide this
information for us, called AUC.
AUC: Area Under the ROC Curve AUC stands for ”Area under the ROC Curve.”
That is, AUC measures the entire two-dimensional area underneath the entire ROC
curve (think integral calculus) from (0,0) to (1,1). An excellent model has AUC near
to the 1 which means it has a good measure of separability. A poor model has an
AUC near 0 which means it has the worst measure of separability. In fact, it means
it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC
is 0.5, it means the model has no class separation capacity whatsoever. When two

Figure 15: Example of AUC=100%

distributions overlap, we introduce type 1 and type 2 errors. Depending upon the
threshold, we can minimize or maximize them. When AUC is 0.7, it means there is
a 70 % chance that the model will be able to distinguish between positive class and
negative class

Figure 16: Example of AUC=70%

27
This is the worst situation. When AUC is approximately 0.5, the model has no
discrimination capacity to distinguish between positive class and negative class. When

Figure 17: Example of AUC=50%

AUC is approximately 0, the model is actually reciprocating the classes. It means the
model is predicting a negative class as a positive class and vice versa.

Figure 18: Example of AUC=0%

28

You might also like