Ads Lab5
Ads Lab5
05
AIM: Use the SMOTE technique to generate synthetic data (to solve the problem of class
imbalance) Use any dataset to check for imbalance ratio and perform balancing by random
generating and with SMOTE and compare the accuracy and other evaluation metrics
THEORY:
Class Imbalance:
In many real-world classification problems, the distribution of classes is often uneven, with
one class significantly outnumbering the others. This class imbalance can lead machine
learning models to be biased towards the majority class, resulting in poor performance on the
minority class. This is a common issue in various domains, such as fraud detection, medical
diagnosis, and rare event prediction.
Synthetic Minority Over-sampling Technique (SMOTE):
1. Introduction:
● Objective: SMOTE aims to alleviate the impact of class imbalance by oversampling
the minority class through the generation of synthetic examples.
● Key Idea: Instead of replicating existing minority class instances, SMOTE creates
synthetic samples by interpolating between existing minority class instances.
2. How SMOTE Works:
● Nearest Neighbors: SMOTE operates by selecting a minority class instance and its
k-nearest neighbors.
● Synthetic Sample Creation: A synthetic sample is generated by selecting one of the
k-nearest neighbors and creating a convex combination of the feature values between
the selected instance and that neighbor.
3. Algorithm Steps:
For each minority class instance:
● Identify its k-nearest neighbors.
● Randomly select one of the neighbors.
● Generate synthetic samples by interpolating between the chosen neighbor and the
original instance.
● Repeat until the desired balance between classes is achieved.
4. Advantages of SMOTE:
● Mitigating Overfitting: SMOTE helps in reducing overfitting, as it introduces
diversity into the dataset without simply duplicating existing examples.
● Improved Generalization: The synthetic samples contribute to a better generalization
of the model, especially when the available data is limited.
5. Considerations:
● Parameter Tuning: Users need to decide on the number of synthetic samples to
generate (controlled by parameters like the oversampling ratio and k-neighbors).
● Impact on Model Interpretability: Introducing synthetic samples may affect the
interpretability of the model, as these examples do not correspond to actual
observations.
6. Limitations:
Sensitive to Noise: SMOTE may introduce noise when dealing with noisy datasets.
Potential Overfitting: If not used cautiously, SMOTE might lead to overfitting, especially
when the number of synthetic samples is excessive.
In summary, SMOTE is a valuable technique for addressing class imbalance by creating
synthetic samples for the minority class. However, its application should be done carefully,
considering the characteristics of the dataset and potential impacts on model performance and
interpretability
CODE:
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix
np.random.seed(42)
X_minority = np.random.rand(100, 20) # Minority class
X_majority = np.random.rand(900, 20) # Majority class
X = np.vstack((X_minority, X_majority))
y = np.hstack((np.ones(100), np.zeros(900))) # Labels (1 for minority, 0 for majority)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
imbalance_ratio = np.sum(y_train == 0) / np.sum(y_train == 1)
print(f"Imbalance ratio before balancing: {imbalance_ratio}")
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
imbalance_ratio_after_smote = np.sum(y_resampled == 0) / np.sum(y_resampled == 1)
print(f"Imbalance ratio after SMOTE: {imbalance_ratio_after_smote}")
clf_original = RandomForestClassifier(random_state=42)
clf_original.fit(X_train, y_train)
clf_smote = RandomForestClassifier(random_state=42)
clf_smote.fit(X_resampled, y_resampled)
def evaluate_model(clf, X_test, y_test):
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
return accuracy, precision, recall, f1, confusion_mat
accuracy_original, precision_original, recall_original, f1_original, conf_mat_original =
evaluate_model(clf_original, X_test, y_test)
accuracy_smote, precision_smote, recall_smote, f1_smote, conf_mat_smote =
evaluate_model(clf_smote, X_test, y_test)
print("\nResults on the original test set:")
print(f"Accuracy: {accuracy_original}")
print(f"Precision: {precision_original}")
print(f"Recall: {recall_original}")
print(f"F1 Score: {f1_original}")
print(f"Confusion Matrix:\n{conf_mat_original}")
print("\nResults on the SMOTE-resampled test set:")
print(f"Accuracy: {accuracy_smote}")
print(f"Precision: {precision_smote}")
print(f"Recall: {recall_smote}")
print(f"F1 Score: {f1_smote}")
print(f"Confusion Matrix:\n{conf_mat_smote}")
OUTPUT:
CONCLUSION:
We successfully studied that With a mildly imbalanced dataset, applying SMOTE resulted in
negligible changes in model performance.