Articles Xgboost Classification With Smote-Enn Algorithm
Articles Xgboost Classification With Smote-Enn Algorithm
by Gustiyan Islahuzaman
Introduction
Before embarking on our classification journey, let’s delve into the foundation of data integrity.
Sourced from Kaggle, our dataset is a realm of insights, comprising 21 columns and an
impressive 41,188 rows.
Attribute Information: Within this landscape, we explore a rich tapestry:
1. Input Variables:
· Subscription: Binary indicator of whether the client subscribed to a term deposit (“yes”
or “no”).
Our voyage begins by sculpting data quality and embracing its nuances. As we traverse this
domain, the marriage of numbers, categories, and trends shall guide our classification odyssey.
Let’s delve further, where insight and analysis converge in pursuit of precision.
Cleaning Data: Before diving into analysis, we tidy up our dataset. This involves spotting and
fixing errors, handling missing values, and ensuring data consistency.
Standardizing Data: Next, we bring uniformity to our data. Standardization adjusts values to a
common scale, aiding fair comparison across features.
Encoding Data: To bridge the gap between words and numbers, we employ encoding. This
translates categorical data into numerical form, ready for analysis.
With these simple yet powerful steps, we lay the groundwork for extracting insights and patterns
from our data. Setting Up the Workspace: Our journey starts with assembling the essential tools.
We’ve enlisted the aid of well-loved Python libraries:
Together, these tools empower us to embark on an insightful exploration, unveiling the hidden
stories within our data.
1. Loading Data: Our journey commences with loading the dataset. We bring it into our
workspace, preparing to unveil its hidden stories.
# Import libraries
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load dataset
data_path = 'https://raw.githubusercontent.com/gustiyaniz/
↪PwCAssigmentClassification/main/bank-additional-full.csv'
df = pd.read_csv(data_path,sep=";",header=0)
# print(df info())
print('Shape of dataframe:', df.shape)
df.head()
2. Checking Data Types: Understanding data is crucial. We examine the data types of various
columns to ensure they align with their content. This step paves the way for accurate analysis.
from sklearn.preprocessing import StandardScaler
#names columns
columns = list(df.columns)
columns
# check datatype in each column
print("Column datatypes: ")
print(df.dtypes)
# Copying original dataframe
df_set = df.copy()
3. Removing Unnecessary Columns: Not all columns contribute equally. We streamline our
dataset by removing irrelevant columns, such as ‘duration’, to focus on what truly matters.
# Drop 'duration' column
df = df.drop(columns='duration')
df.head()
With this streamlined approach, we ensure our dataset is primed for meaningful analysis, setting
the stage for insightful discoveries.
Standardizing Data
To standardize the numerical columns in our dataset, we’ll use the StandardScaler from the
sklearn.preprocessing module. This scaler will transform our data to have a mean of 0 and a
standard deviation of 1.
from sklearn.preprocessing import StandardScaler
#names columns
columns = list(df.columns)
columns
# check datatype in each column
print("Column datatypes: ")
print(df.dtypes)
# Copying original dataframe
df_set = df.copy()
#standarization data numeric
scaler = StandardScaler()
num_cols = ['age', 'campaign', 'pdays', 'previous','emp.var.rate',
'cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
df_set[num_cols] = scaler.fit_transform(df[num_cols])
df_set.head()
# Print the shape and the first few rows of the DataFrame
print('Shape of dataframe:', df_set.shape)
df_set.head()
# Select Target
target = df_set['y'] # Target (output variable)
print('After', counter)
2. XGBoost Classification
Enter XGBoost, a powerhouse in classification. We harness its might to create a classification
model that learns from the data’s intricate patterns, making accurate predictions.
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score,
recall_score, precision_score, confusion_matrix
values_smote = [f1,
accuracy,
kappa,
recall,
precision]
plt.show()
Observations:
1. Adding Prediction Labels to the Dataset: Our journey comes full circle as we enrich the
dataset with prediction labels. This addition showcases how our model classifies each
instance, offering a holistic view of the classification outcomes.
import joblib
As we conclude this phase, we not only preserve our hard-earned results but also imbue our
dataset with newfound knowledge, making it a richer resource for future insights.
Conclusion
In conclusion, XGBoost classification with SMOTE + ENN algorithm is a powerful tool for
improving classification outcomes on imbalanced datasets. Our results demonstrate the
importance of data preprocessing and balancing uneven class distribution in training data for
accurate classification outcomes. In future work, we can explore the potential applications of
XGBoost classification with SMOTE + ENN algorithm in other domains and datasets.
References
1. https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-
smote-techniques/
2. https://medium.com/analytics-vidhya/building-classification-model-with-python-
9bdfc13faa4b
3. https://xgboost.readthedocs.io/en/stable/python/python_intro.html