0% found this document useful (0 votes)
0 views

Vertopal.com AML Project LearnerNotebook LowCode

Thera bank is facing a decline in credit card users, prompting the need to analyze customer data to identify potential attrition and improve services. The document outlines a classification model to predict customer attrition based on various attributes such as age, gender, income, and card usage. It includes data descriptions, analysis instructions, and necessary libraries for implementation in a Jupyter Notebook.

Uploaded by

marysaranyag
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Vertopal.com AML Project LearnerNotebook LowCode

Thera bank is facing a decline in credit card users, prompting the need to analyze customer data to identify potential attrition and improve services. The document outlines a classification model to predict customer attrition based on various attributes such as age, gender, income, and card usage. It includes data descriptions, analysis instructions, and necessary libraries for implementation in a Jupyter Notebook.

Uploaded by

marysaranyag
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Problem Statement

Business Context
The Thera bank recently saw a steep decline in the number of users of their credit card, credit
cards are a good source of income for banks because of different kinds of fees charged by the
banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign
transaction fees, and others. Some fees are charged to every user irrespective of usage, while
others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze
the data of customers and identify the customers who will leave their credit card services and
reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help
the bank improve its services so that customers do not renounce their credit cards

Data Description
• CLIENTNUM: Client number. Unique identifier for the customer holding the account
• Attrition_Flag: Internal event (customer activity) variable - if the account is closed then
"Attrited Customer" else "Existing Customer"
• Customer_Age: Age in Years
• Gender: Gender of the account holder
• Dependent_count: Number of dependents
• Education_Level: Educational Qualification of the account holder - Graduate, High
School, Unknown, Uneducated, College(refers to college student), Post-Graduate,
Doctorate
• Marital_Status: Marital Status of the account holder
• Income_Category: Annual Income Category of the account holder
• Card_Category: Type of Card
• Months_on_book: Period of relationship with the bank (in months)
• Total_Relationship_Count: Total no. of products held by the customer
• Months_Inactive_12_mon: No. of months inactive in the last 12 months
• Contacts_Count_12_mon: No. of Contacts in the last 12 months
• Credit_Limit: Credit Limit on the Credit Card
• Total_Revolving_Bal: Total Revolving Balance on the Credit Card
• Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
• Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
• Total_Trans_Amt: Total Transaction Amount (Last 12 months)
• Total_Trans_Ct: Total Transaction Count (Last 12 months)
• Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
• Avg_Utilization_Ratio: Average Card Utilization Ratio
What Is a Revolving Balance?
• If we don't pay the balance of the revolving credit account in full every month, the unpaid
portion carries over to the next month. That's called a revolving balance

What is the Average Open to buy?


• 'Open to Buy' means the amount left on your credit card to use. Now, this column
represents the average of this value for the last 12 months.

What is the Average utilization Ratio?


• The Avg_Utilization_Ratio represents how much of the available credit the customer
spent. This is useful for calculating credit scores.

Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:


• ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Please read the instructions carefully before starting the


project.
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be
performed are mentioned.

• Blanks '_______' are provided in the notebook that needs to be filled with an appropriate
code to get the correct result. With every '_______' blank, there is a comment that briefly
describes what needs to be filled in the blank space.
• Identify the task to be performed correctly, and only then proceed to write the required
code.
• Fill the code wherever asked by the commented lines like "# write your code here" or "#
complete the code". Running incomplete code may throw error.
• Please run the codes in a sequential manner from the beginning to avoid any
unnecessary errors.
• Add the results/observations (wherever mentioned) derived from the analysis in the
presentation and submit the same.

Importing necessary libraries


# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1
numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q
--user

# Installing the libraries with the specified version.


# uncomment and run the following lines if Jupyter Notebook is being
used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1
numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells
sequentially from the start again.

# Libraries to help with reading and manipulating data


import pandas as pd
import numpy as np

# To suppress scientific notations


pd.set_option("display.float_format", lambda x: "%.3f" % x)

# Libaries to help with data visualization


import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data


from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
)
from sklearn.model_selection import train_test_split, StratifiedKFold,
cross_val_score

# To be used for data scaling and one hot encoding


from sklearn.preprocessing import StandardScaler, MinMaxScaler,
OneHotEncoder

# To impute missing values


from sklearn.impute import SimpleImputer

# To oversample and undersample data


from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe


pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe


pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier

# To supress warnings
import warnings
warnings.filterwarnings("ignore")

Loading the dataset


churn = pd.read_csv("BankChurners.csv")

Data Overview
The initial steps to get an overview of any dataset is to:

• observe the first few rows of the dataset, to check whether the dataset has been loaded
properly or not
• get information about the number of rows and columns in the dataset
• find out the data types of the columns to ensure that data is stored in the preferred
format and the value of each property is as expected.
• check the statistical summary of the dataset to get an overview of the numerical columns
of the data

Checking the shape of the dataset


# Checking the number of rows and columns in the training data
churn.shape ## Complete the code to view dimensions of the train data

(10127, 21)

# let's create a copy of the data


data = churn.copy()

Displaying the first few rows of the dataset


# let's view the first 5 rows of the data
data.head() ## Complete the code to view top 5 rows of the data

CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count


\
0 768805383 Existing Customer 45 M 3

1 818770008 Existing Customer 49 F 5

2 713982108 Existing Customer 51 M 3


3 769911858 Existing Customer 40 F 4

4 709106358 Existing Customer 40 M 3

Education_Level Marital_Status Income_Category Card_Category \


0 High School Married $60K - $80K Blue
1 Graduate Single Less than $40K Blue
2 Graduate Married $80K - $120K Blue
3 High School NaN Less than $40K Blue
4 Uneducated Married $60K - $80K Blue

Months_on_book Total_Relationship_Count Months_Inactive_12_mon \


0 39 5 1
1 44 6 1
2 36 4 1
3 34 3 4
4 21 5 1

Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal


Avg_Open_To_Buy \
0 3 12691.000 777
11914.000
1 2 8256.000 864
7392.000
2 0 3418.000 0
3418.000
3 1 3313.000 2517
796.000
4 0 4716.000 0
4716.000

Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct


Total_Ct_Chng_Q4_Q1 \
0 1.335 1144 42
1.625
1 1.541 1291 33
3.714
2 2.594 1887 20
2.333
3 1.405 1171 20
2.333
4 2.175 816 28
2.500

Avg_Utilization_Ratio
0 0.061
1 0.105
2 0.000
3 0.760
4 0.000

# let's view the last 5 rows of the data


data.tail() ## Complete the code to view last 5 rows of the data

CLIENTNUM Attrition_Flag Customer_Age Gender


Dependent_count \
10122 772366833 Existing Customer 50 M
2
10123 710638233 Attrited Customer 41 M
2
10124 716506083 Attrited Customer 44 F
1
10125 717406983 Attrited Customer 30 M
2
10126 714337233 Attrited Customer 43 F
2

Education_Level Marital_Status Income_Category Card_Category \


10122 Graduate Single $40K - $60K Blue
10123 NaN Divorced $40K - $60K Blue
10124 High School Married Less than $40K Blue
10125 Graduate NaN $40K - $60K Blue
10126 Graduate Married Less than $40K Silver

Months_on_book Total_Relationship_Count
Months_Inactive_12_mon \
10122 40 3
2
10123 25 4
2
10124 36 5
3
10125 36 4
3
10126 25 6
2

Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal \


10122 3 4003.000 1851
10123 3 4277.000 2186
10124 4 5409.000 0
10125 3 5281.000 0
10126 4 10388.000 1961

Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt


Total_Trans_Ct \
10122 2152.000 0.703 15476
117
10123 2091.000 0.804 8764
69
10124 5409.000 0.819 10291
60
10125 5281.000 0.535 8395
62
10126 8427.000 0.703 10294
61

Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 0.857 0.462
10123 0.683 0.511
10124 0.818 0.000
10125 0.722 0.000
10126 0.649 0.189

Checking the data types of the columns for the dataset


# let's check the data types of the columns in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CLIENTNUM 10127 non-null int64
1 Attrition_Flag 10127 non-null object
2 Customer_Age 10127 non-null int64
3 Gender 10127 non-null object
4 Dependent_count 10127 non-null int64
5 Education_Level 8608 non-null object
6 Marital_Status 9378 non-null object
7 Income_Category 10127 non-null object
8 Card_Category 10127 non-null object
9 Months_on_book 10127 non-null int64
10 Total_Relationship_Count 10127 non-null int64
11 Months_Inactive_12_mon 10127 non-null int64
12 Contacts_Count_12_mon 10127 non-null int64
13 Credit_Limit 10127 non-null float64
14 Total_Revolving_Bal 10127 non-null int64
15 Avg_Open_To_Buy 10127 non-null float64
16 Total_Amt_Chng_Q4_Q1 10127 non-null float64
17 Total_Trans_Amt 10127 non-null int64
18 Total_Trans_Ct 10127 non-null int64
19 Total_Ct_Chng_Q4_Q1 10127 non-null float64
20 Avg_Utilization_Ratio 10127 non-null float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB
• There are a total of 21 columns and 10,000+ observations in the dataset
• We can see that 2 columns have around 1,000 non-null values i.e. columns have missing
values.

Checking for duplicate values


# let's check for duplicate values in the data
data.duplicated().sum() ## Complete the code to check duplicate
entries in the data

Checking for missing values


# let's check for missing values in the data
data.isnull().sum() ## Complete the code to check missing entries in
the train data

CLIENTNUM 0
Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64

Statistical summary of the dataset


# let's view the statistical summary of the numerical columns in the
data
data.nunique() ## Complete the code to print the statitical summary
of the train data

CLIENTNUM 10127
Attrition_Flag 2
Customer_Age 45
Gender 2
Dependent_count 6
Education_Level 6
Marital_Status 3
Income_Category 6
Card_Category 4
Months_on_book 44
Total_Relationship_Count 6
Months_Inactive_12_mon 7
Contacts_Count_12_mon 7
Credit_Limit 6205
Total_Revolving_Bal 1974
Avg_Open_To_Buy 6813
Total_Amt_Chng_Q4_Q1 1158
Total_Trans_Amt 5033
Total_Trans_Ct 126
Total_Ct_Chng_Q4_Q1 830
Avg_Utilization_Ratio 964
dtype: int64

• Customer_Age has only 45 unique values i.e. most of the customers are of similar age
• We have many continuous variables - Customer_Age, Credit_Limit and
Total_Revolving_Bal, for example.
• All other variables are categorical
data.describe(include=["object"]).T

count unique top freq


Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436

for i in data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)

Unique values in Attrition_Flag are :


Attrition_Flag
Existing Customer 8500
Attrited Customer 1627
Name: count, dtype: int64
**************************************************
Unique values in Gender are :
Gender
F 5358
M 4769
Name: count, dtype: int64
**************************************************
Unique values in Education_Level are :
Education_Level
Graduate 3128
High School 2013
Uneducated 1487
College 1013
Post-Graduate 516
Doctorate 451
Name: count, dtype: int64
**************************************************
Unique values in Marital_Status are :
Marital_Status
Married 4687
Single 3943
Divorced 748
Name: count, dtype: int64
**************************************************
Unique values in Income_Category are :
Income_Category
Less than $40K 3561
$40K - $60K 1790
$80K - $120K 1535
$60K - $80K 1402
abc 1112
$120K + 727
Name: count, dtype: int64
**************************************************
Unique values in Card_Category are :
Card_Category
Blue 9436
Silver 555
Gold 116
Platinum 20
Name: count, dtype: int64
**************************************************

# CLIENTNUM consists of uniques ID for clients and hence will not add
value to the modeling
data.drop(["CLIENTNUM"], axis=1, inplace=True)

## Encoding Existing and Attrited customers to 0 and 1 respectively,


for analysis.
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
Exploratory Data Analysis
The below functions need to be defined to carry out the Exploratory Data Analysis.
# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False,


bins=None):
"""
Boxplot and histogram combined

data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True,
color="violet"
) # boxplot will be created and a triangle will indicate the mean
value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins,
palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram

# function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):


"""
Barplot with percentage at the top

data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is
False)
n: displays the top n category levels (default is None, i.e.,
display all levels)
"""

total = len(data[feature]) # length of the column


count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))

plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)

for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the
category

x = p.get_x() + p.get_width() / 2 # width of the plot


y = p.get_height() # height of the plot

ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage

plt.show() # show the plot

# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):


"""
Print the category counts and plot a stacked bar chart

data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target],
margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target],
normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()

### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

fig, axs = plt.subplots(2, 2, figsize=(12, 10))

target_uniq = data[target].unique()

axs[0, 0].set_title("Distribution of target for target=" +


str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)

axs[0, 1].set_title("Distribution of target for target=" +


str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)

axs[1, 0].set_title("Boxplot w.r.t target")


sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0],
palette="gist_rainbow")

axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")


sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)

plt.tight_layout()
plt.show()

Univariate analysis
Customer_Age

histogram_boxplot(data, "Customer_Age", kde=True)

Months_on_book
histogram_boxplot(data,"Months_on_book") ## Complete the code to
create histogram_boxplot for 'New_Price'

Credit_Limit

histogram_boxplot(data, "Credit_Limit") ## Complete the code to create


histogram_boxplot for 'New_Price'
Total_Revolving_Bal

histogram_boxplot(data, "Total_Revolving_Bal") ## Complete the code


to create histogram_boxplot for 'New_Price'
Avg_Open_To_Buy

histogram_boxplot(data, "Avg_Open_To_Buy") ## Complete the code to


create histogram_boxplot for 'New_Price'

Total_Trans_Ct

histogram_boxplot(data, 'Total_Trans_Ct') ## Complete the code to


create histogram_boxplot for 'New_Price'
Total_Amt_Chng_Q4_Q1

histogram_boxplot(data, 'Total_Amt_Chng_Q4_Q1') ## Complete the code


to create histogram_boxplot for 'New_Price'
Let's see total transaction amount distributed
Total_Trans_Amt

histogram_boxplot(data, 'Total_Trans_Amt') ## Complete the code to


create histogram_boxplot for 'New_Price'

Total_Ct_Chng_Q4_Q1

histogram_boxplot(data, 'Total_Ct_Chng_Q4_Q1') ## Complete the code


to create histogram_boxplot for 'New_Price'
Avg_Utilization_Ratio

histogram_boxplot(data, 'Avg_Utilization_Ratio') ## Complete the code


to create histogram_boxplot for 'New_Price'
Dependent_count

labeled_barplot(data, "Dependent_count")

Total_Relationship_Count

labeled_barplot(data, 'Total_Relationship_Count') ## Complete the code


to create labeled_barplot for 'Total_Relationship_Count'
Months_Inactive_12_mon

labeled_barplot(data, 'Months_Inactive_12_mon') ## Complete the code


to create labeled_barplot for 'Months_Inactive_12_mon'
Contacts_Count_12_mon

labeled_barplot(data, 'Contacts_Count_12_mon') ## Complete the code to


create labeled_barplot for 'Contacts_Count_12_mon'
Gender

labeled_barplot(data, 'Gender') ## Complete the code to create


labeled_barplot for 'Gender'
Let's see the distribution of the level of education of customers
Education_Level

labeled_barplot(data, 'Education_Level') ## Complete the code to


create labeled_barplot for 'Education_Level'
Marital_Status

labeled_barplot(data, 'Marital_Status') ## Complete the code to create


labeled_barplot for 'Marital_Status'
Let's see the distribution of the level of income of customers
Income_Category

labeled_barplot(data, 'Income_Category') ## Complete the code to


create labeled_barplot for 'Income_Category'
Card_Category

labeled_barplot(data, 'Card_Category') ## Complete the code to create


labeled_barplot for 'Card_Category'
Attrition_Flag

labeled_barplot(data, 'Attrition_Flag') ## Complete the code to create


labeled_barplot for 'Attrition_Flag'
# creating histograms
data.hist(figsize=(14, 14))
plt.show()
Bivariate Distributions
Let's see the attributes that have a strong correlation with each other

Correlation Check

# Plot the heatmap for the correlation matrix


plt.figure(figsize=(15, 7))
sns.heatmap(numeric_data.corr(), annot=True, vmin=-1, vmax=1,
fmt=".2f", cmap="Spectral")
plt.show()
Attrition_Flag vs Gender

stacked_barplot(data, "Gender", "Attrition_Flag")

Attrition_Flag 0 1 All
Gender
All 8500 1627 10127
F 4428 930 5358
M 4072 697 4769
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Marital_Status

stacked_barplot(data,"Attrition_Flag", "Marital_Status") ## Complete


the code to create distribution_plot for Attrition_Flag vs
Marital_Status

Marital_Status Divorced Married Single All


Attrition_Flag
All 748 4687 3943 9378
0 627 3978 3275 7880
1 121 709 668 1498
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Education_Level

stacked_barplot(data,"Attrition_Flag", "Education_Level") ## Complete


the code to create distribution_plot for Attrition_Flag vs
Education_Level

Education_Level College Doctorate Graduate High School Post-


Graduate \
Attrition_Flag

All 1013 451 3128 2013


516
0 859 356 2641 1707
424
1 154 95 487 306
92

Education_Level Uneducated All


Attrition_Flag
All 1487 8608
0 1250 7237
1 237 1371
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Income_Category

stacked_barplot(data,"Attrition_Flag", "Income_Category") ## Complete


the code to create distribution_plot for Attrition_Flag vs
Income_Category

Income_Category $120K + $40K - $60K $60K - $80K $80K - $120K \


Attrition_Flag
All 727 1790 1402 1535
0 601 1519 1213 1293
1 126 271 189 242

Income_Category Less than $40K abc All


Attrition_Flag
All 3561 1112 10127
0 2949 925 8500
1 612 187 1627
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Contacts_Count_12_mon

stacked_barplot(data,"Attrition_Flag", "Contacts_Count_12_mon") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Income_Category

Contacts_Count_12_mon 0 1 2 3 4 5 6 All
Attrition_Flag
1 7 108 403 681 315 59 54 1627
All 399 1499 3227 3380 1392 176 54 10127
0 392 1391 2824 2699 1077 117 0 8500
----------------------------------------------------------------------
--------------------------------------------------
Let's see the number of months a customer was inactive in the last 12 months
(Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)
Attrition_Flag vs Months_Inactive_12_mon

stacked_barplot(data,"Attrition_Flag", "Months_Inactive_12_mon") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Months_Inactive_12_mon

Months_Inactive_12_mon 0 1 2 3 4 5 6 All
Attrition_Flag
All 29 2233 3282 3846 435 178 124 10127
1 15 100 505 826 130 32 19 1627
0 14 2133 2777 3020 305 146 105 8500
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Total_Relationship_Count

stacked_barplot(data,"Attrition_Flag", "Total_Relationship_Count") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Total_Relationship_Count

Total_Relationship_Count 1 2 3 4 5 6 All
Attrition_Flag
All 910 1243 2305 1912 1891 1866 10127
0 677 897 1905 1687 1664 1670 8500
1 233 346 400 225 227 196 1627
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Dependent_count

stacked_barplot(data,"Attrition_Flag", "Dependent_count") ## Complete


the code to create distribution_plot for Attrition_Flag vs
Dependent_count

Dependent_count 0 1 2 3 4 5 All
Attrition_Flag
All 904 1838 2655 2732 1574 424 10127
0 769 1569 2238 2250 1314 360 8500
1 135 269 417 482 260 64 1627
----------------------------------------------------------------------
--------------------------------------------------
Total_Revolving_Bal vs Attrition_Flag

distribution_plot_wrt_target(data, "Total_Revolving_Bal",
"Attrition_Flag")
Attrition_Flag vs Credit_Limit

distribution_plot_wrt_target(data, "Attrition_Flag", "Credit_Limit")


## Complete the code to create distribution_plot for Attrition_Flag vs
Credit_Limit
Attrition_Flag vs Customer_Age

distribution_plot_wrt_target(data, "Attrition_Flag", "Customer_Age")


## Complete the code to create distribution_plot for Attrition_Flag vs
Customer_Age
Total_Trans_Ct vs Attrition_Flag

distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")


## Complete the code to create distribution_plot for Total_Trans_Ct vs
Attrition_Flag
Total_Trans_Amt vs Attrition_Flag

distribution_plot_wrt_target(data, "Total_Trans_Amt",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Total_Trans_Amt vs Attrition_Flag
Let's see the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1)
vary by the customer's account status (Attrition_Flag)

Total_Ct_Chng_Q4_Q1 vs Attrition_Flag

distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Total_Ct_Chng_Q4_Q1 vs Attrition_Flag
Avg_Utilization_Ratio vs Attrition_Flag

distribution_plot_wrt_target(data, "Avg_Utilization_Ratio",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Avg_Utilization_Ratio vs Attrition_Flag
Attrition_Flag vs Months_on_book

distribution_plot_wrt_target(data, "Attrition_Flag", "Months_on_book")


## Complete the code to create distribution_plot for Attrition_Flag vs
Months_on_book
Attrition_Flag vs Total_Revolving_Bal

distribution_plot_wrt_target(data, "Attrition_Flag",
"Total_Revolving_Bal") ## Complete the code to create
distribution_plot for Attrition_Flag vs Total_Revolving_Bal
Attrition_Flag vs Avg_Open_To_Buy

distribution_plot_wrt_target(data, "Attrition_Flag",
"Avg_Open_To_Buy") ## Complete the code to create distribution_plot
for Attrition_Flag vs Avg_Open_To_Buy
Data Preprocessing
Outlier Detection
Q1 = numeric_data.quantile(0.25) # To find the 25th percentile
Q3 = numeric_data.quantile(0.75) # To find the 75th percentile

IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th


percentile)

# Finding lower and upper bounds for all values. All values outside
these bounds are outliers
lower = (Q1 - 1.5 * IQR)
upper = (Q3 + 1.5 * IQR)

# checking the % outliers


((data.select_dtypes(include=["float64", "int64"]) < lower) |
(data.select_dtypes(include=["float64", "int64"]) > upper)).sum() /
len(data) * 100
Attrition_Flag 16.066
Customer_Age 0.020
Dependent_count 0.000
Months_on_book 3.812
Total_Relationship_Count 0.000
Months_Inactive_12_mon 3.268
Contacts_Count_12_mon 6.211
Credit_Limit 9.717
Total_Revolving_Bal 0.000
Avg_Open_To_Buy 9.509
Total_Amt_Chng_Q4_Q1 3.910
Total_Trans_Amt 8.848
Total_Trans_Ct 0.020
Total_Ct_Chng_Q4_Q1 3.891
Avg_Utilization_Ratio 0.000
dtype: float64

Train-Test Split
# creating the copy of the dataframe
data1 = data.copy()

data1["Income_Category"].replace("Unknown", np.nan, inplace=True) ###


complete the code to replace the anomalous values with NaN

data1.isna().sum()

Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")

# Dividing train data into X and y

X = data1.drop(["Attrition_Flag"], axis=1)
y = data1["Attrition_Flag"]

# Splitting data into training and validation set:

X_train, X_temp, y_train, y_temp = train_test_split(X, y,


test_size=0.2, random_state=42) ## Complete the code to split the data
into train test in the ratio 80:20

X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp,


test_size=0.25, random_state=42) ## Complete the code to split the
data into train test in the ratio 75:25

print(X_train.shape, X_val.shape, X_test.shape)

(8101, 19) (507, 19) (1519, 19)

Missing value imputation


reqd_col_for_impute = ["Education_Level", "Marital_Status",
"Income_Category"]

# Fit and transform the train data


X_train[reqd_col_for_impute] =
imputer.fit_transform(X_train[reqd_col_for_impute])

# Transform the validation data


X_val[reqd_col_for_impute] =
imputer.transform(X_val[reqd_col_for_impute]) ## Complete the code to
impute missing values in X_val

# Transform the test data


X_test[reqd_col_for_impute] =
imputer.transform(X_test[reqd_col_for_impute]) ## Complete the code to
impute missing values in X_test

# Checking that no column has missing values in train or test sets


print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())

Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
------------------------------
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
------------------------------
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64

cols = X_train.select_dtypes(include=["object", "category"])


for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)

Gender
F 4279
M 3822
Name: count, dtype: int64
******************************
Education_Level
Graduate 3733
High School 1619
Uneducated 1171
College 816
Post-Graduate 407
Doctorate 355
Name: count, dtype: int64
******************************
Marital_Status
Married 4346
Single 3144
Divorced 611
Name: count, dtype: int64
******************************
Income_Category
Less than $40K 2812
$40K - $60K 1453
$80K - $120K 1237
$60K - $80K 1122
abc 889
$120K + 588
Name: count, dtype: int64
******************************
Card_Category
Blue 7557
Silver 436
Gold 93
Platinum 15
Name: count, dtype: int64
******************************

cols = X_val.select_dtypes(include=["object", "category"])


for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 30)

Gender
F 266
M 241
Name: count, dtype: int64
******************************
Education_Level
Graduate 237
High School 94
Uneducated 84
College 49
Doctorate 24
Post-Graduate 19
Name: count, dtype: int64
******************************
Marital_Status
Married 272
Single 193
Divorced 42
Name: count, dtype: int64
******************************
Income_Category
Less than $40K 174
$40K - $60K 88
$60K - $80K 74
$80K - $120K 71
abc 62
$120K + 38
Name: count, dtype: int64
******************************
Card_Category
Blue 465
Silver 37
Gold 3
Platinum 2
Name: count, dtype: int64
******************************

cols = X_test.select_dtypes(include=["object", "category"])


for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)
Gender
F 4279
M 3822
Name: count, dtype: int64
******************************
Education_Level
Graduate 3733
High School 1619
Uneducated 1171
College 816
Post-Graduate 407
Doctorate 355
Name: count, dtype: int64
******************************
Marital_Status
Married 4346
Single 3144
Divorced 611
Name: count, dtype: int64
******************************
Income_Category
Less than $40K 2812
$40K - $60K 1453
$80K - $120K 1237
$60K - $80K 1122
abc 889
$120K + 588
Name: count, dtype: int64
******************************
Card_Category
Blue 7557
Silver 436
Gold 93
Platinum 15
Name: count, dtype: int64
******************************

Encoding categorical variables


X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True) ## Complete the code
to impute missing values in X_val
X_test = pd.get_dummies(X_test, drop_first=True) ## Complete the code
to impute missing values in X_val
print(X_train.shape, X_val.shape, X_test.shape)

(8101, 30) (507, 30) (1519, 30)

• After encoding there are 29 columns.


# check the top 5 rows from the train dataset
X_train.head()

Customer_Age Dependent_count Months_on_book


Total_Relationship_Count \
9066 54 1 36
1
5814 58 4 48
1
792 45 4 36
6
1791 34 2 36
4
5011 49 2 39
5

Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit \


9066 3 3 3723.000
5814 4 3 5396.000
792 1 3 15987.000
1791 3 4 3625.000
5011 3 4 2720.000

Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 \


9066 1728 1995.000 0.595
5814 1803 3593.000 0.493
792 1648 14339.000 0.732
1791 2517 1108.000 1.158
5011 1926 794.000 0.602

Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 \


9066 8554 99 0.678
5814 2107 39 0.393
792 1436 36 1.250
1791 2616 46 1.300
5011 3806 61 0.794

Avg_Utilization_Ratio Gender_M Education_Level_Doctorate \


9066 0.464 False False
5814 0.334 False False
792 0.103 False False
1791 0.694 False False
5011 0.708 False False

Education_Level_Graduate Education_Level_High School \


9066 True False
5814 False True
792 True False
1791 True False
5011 False True
Education_Level_Post-Graduate Education_Level_Uneducated \
9066 False False
5814 False False
792 False False
1791 False False
5011 False False

Marital_Status_Married Marital_Status_Single \
9066 False True
5814 True False
792 False True
1791 False True
5011 True False

Income_Category_$40K - $60K Income_Category_$60K - $80K \


9066 False False
5814 False False
792 False False
1791 False False
5011 True False

Income_Category_$80K - $120K Income_Category_Less than $40K \


9066 False False
5814 False False
792 False True
1791 False True
5011 False False

Income_Category_abc Card_Category_Gold Card_Category_Platinum


\
9066 True False False

5814 True False False

792 False True False

1791 False False False

5011 False False False

Card_Category_Silver
9066 False
5814 False
792 False
1791 False
5011 False
Model Building
Model evaluation criterion
Model can make wrong predictions as:

• Predicting a customer will attrite and the customer doesn't attrite


• Predicting a customer will not attrite and the customer attrites

Which case is more important?

• Predicting that customer will not attrite but he attrites i.e. losing on a valuable customer
or asset.

How to reduce this loss i.e need to reduce False Negatives??

• Bank would want Recall to be maximized, greater the Recall higher the chances of
minimizing false negatives. Hence, the focus should be on increasing Recall or
minimizing the false negatives or in other words identifying the true positives(i.e. Class 1)
so that the bank can retain their valuable customers by identifying the customers who
are at risk of attrition.

Let's define a function to output different metrics (including recall) on the train and test set
and a function to show confusion matrix so that we do not have to use the same code
repetitively while evaluating models.

# defining a function to compute different metrics to check


performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors,
target):
"""
Function to compute different metrics to check classification
model performance

model: classifier
predictors: independent variables
target: dependent variable
"""

# predicting using the independent variables


pred = model.predict(predictors)

acc = accuracy_score(target, pred) # to compute Accuracy


recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score

# creating a dataframe of metrics


df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision,
"F1": f1,},
index=[0],
)

return df_perf

def confusion_matrix_sklearn(model, predictors, target):


"""
To plot the confusion_matrix with percentages

model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item /
cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")

Model Building - Original Data


models = [] # Empty list to store all the models

# Appending models into the list


models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest",
RandomForestClassifier(random_state=1)))
'_______' ## Complete the code to append remaining 3 models in the
list models

print("\n" "Training Performance:" "\n")


for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:


model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))

Training Performance:

Bagging: 0.9784615384615385
Random forest: 1.0

Validation Performance:

Bagging: 0.8513513513513513
Random forest: 0.7432432432432432

Model Building - Oversampled Data


print("Before Oversampling, counts of label 'Yes':
{}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \
n".format(sum(y_train == 0)))

sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

print("After Oversampling, counts of label 'Yes':


{}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \
n".format(sum(y_train_over == 0)))

print("After Oversampling, the shape of train_X:


{}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \
n".format(y_train_over.shape))

Before Oversampling, counts of label 'Yes': 1300


Before Oversampling, counts of label 'No': 6801

After Oversampling, counts of label 'Yes': 6801


After Oversampling, counts of label 'No': 6801

After Oversampling, the shape of train_X: (13602, 30)


After Oversampling, the shape of train_y: (13602,)

models = [] # Empty list to store all the models

# Appending models into the list


models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest",
RandomForestClassifier(random_state=1)))
'_______' ## Complete the code to append remaining 3 models in the
list models

print("\n" "Training Performance:" "\n")


for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
## Complete the code to build models on oversampled data
print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:


model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))

Training Performance:

Bagging: 0.9979414791942361
Random forest: 1.0

Validation Performance:

Bagging: 0.8918918918918919
Random forest: 0.8378378378378378

Model Building - Undersampled Data


rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

print("Before Under Sampling, counts of label 'Yes':


{}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \
n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes':


{}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \
n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X:


{}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \
n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 1300
Before Under Sampling, counts of label 'No': 6801

After Under Sampling, counts of label 'Yes': 1300


After Under Sampling, counts of label 'No': 1300

After Under Sampling, the shape of train_X: (2600, 30)


After Under Sampling, the shape of train_y: (2600,)

models = [] # Empty list to store all the models

# Appending models into the list


models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest",
RandomForestClassifier(random_state=1)))
'_______' ## Complete the code to append remaining 3 models in the
list models

print("\n" "Training Performance:" "\n")


for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un)) ##
Complete the code to build models on undersampled data
print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:


model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))

Training Performance:

Bagging: 0.9930769230769231
Random forest: 1.0

Validation Performance:

Bagging: 0.9054054054054054
Random forest: 0.918918918918919

Hyperparameter Tuning
Note
1. Sample parameter grids have been provided to do necessary hyperparameter tuning.
These sample grids are expected to provide a balance between model performance
improvement and execution time. One can extend/reduce the parameter grid based on
execution time and system configuration.
• Please note that if the parameter grid is extended to improve the model performance
further, the execution time will increase
1. The models chosen in this notebook are based on test runs. One can update the best
models as obtained upon code execution and tune them for best performance.

Tuning AdaBoost using original data


%%time

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV


param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}

# Type of scoring used to compare parameter combinations


scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_jobs = -1, n_iter=50,
scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV


randomized_cv.fit(X_train, y_train) ## Complete the code to fit the
model on original data

print("Best parameters are {} with CV


score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score
_))

Best parameters are {'n_estimators': 100, 'learning_rate': 0.1,


'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
with CV score=0.8699999999999999:
CPU times: total: 4.52 s
Wall time: 1min 24s

# Creating new pipeline with best parameters


tuned_adb = AdaBoostClassifier(
random_state=1,
n_estimators=randomized_cv.best_params_['n_estimators'],
learning_rate=randomized_cv.best_params_['learning_rate'],
base_estimator=DecisionTreeClassifier(max_depth=randomized_cv.best_par
ams_['base_estimator'].max_depth, random_state=1)
)

# Fit the model on the original data


tuned_adb.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,

random_state=1),
learning_rate=0.1, n_estimators=100,
random_state=1)

adb_train = model_performance_classification_sklearn(tuned_adb,
X_train, y_train) ## Complete the code to check the performance on
training set
adb_train

Accuracy Recall Precision F1


0 0.985 0.934 0.969 0.951

# Checking model's performance on validation set


adb_val = model_performance_classification_sklearn(tuned_adb, X_val,
y_val) ## Complete the code to check the performance on validation set
adb_val

Accuracy Recall Precision F1


0 0.966 0.892 0.880 0.886

Tuning Ada Boost using undersampled data


# Creating new pipeline with best parameters
tuned_ada2 = AdaBoostClassifier(
random_state=1,
n_estimators=randomized_cv.best_params_['n_estimators'],
learning_rate=randomized_cv.best_params_['learning_rate'],

base_estimator=DecisionTreeClassifier(max_depth=randomized_cv.best_par
ams_['base_estimator'].max_depth, random_state=1)
)

# Fit the model on undersampled data


tuned_ada2.fit(X_train_un, y_train_un)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,

random_state=1),
learning_rate=0.1, n_estimators=100,
random_state=1)
adb2_train = model_performance_classification_sklearn(tuned_ada2,
X_train_un, y_train_un) ## Complete the code to check the performance
on training set
adb2_train

Accuracy Recall Precision F1


0 0.992 0.993 0.991 0.992

# Checking model's performance on validation set


adb2_val = model_performance_classification_sklearn(tuned_ada2,
X_val, y_val) ## Complete the code to check the performance on
validation set
adb2_val

Accuracy Recall Precision F1


0 0.937 0.946 0.714 0.814

Tuning Gradient Boosting using undersampled data


%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV


param_grid = {
"init":
[AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_stat
e=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations


scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5,
random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV


randomized_cv.fit(X_train_un, y_train_un) ## Complete the code to fit
the model on under sampled data

print("Best parameters are {} with CV


score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score
_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100,
'max_features': 0.5, 'learning_rate': 0.1, 'init':
AdaBoostClassifier(random_state=1)} with CV score=0.9546153846153846:
CPU times: total: 2.42 s
Wall time: 58.7 s

# Creating new pipeline with best parameters


tuned_gbm1 = GradientBoostingClassifier(
max_features=randomized_cv.best_params_['max_features'],
init=AdaBoostClassifier(random_state=1),
random_state=1,
learning_rate=randomized_cv.best_params_['learning_rate'],
n_estimators=randomized_cv.best_params_['n_estimators'],
subsample=randomized_cv.best_params_['subsample'],
)## Complete the code with the best parameters obtained from tuning

tuned_gbm1.fit(X_train_un, y_train_un)

GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1,
subsample=0.9)

gbm1_train = model_performance_classification_sklearn(tuned_gbm1,
X_train_un, y_train_un) ## Complete the code to check the performance
on undersampled train set
gbm1_train

Accuracy Recall Precision F1


0 0.975 0.978 0.972 0.975

gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_val,


y_val) ## Complete the code to check the performance on validation set
gbm1_val

Accuracy Recall Precision F1


0 0.937 0.946 0.714 0.814

Tuning Gradient Boosting using original data


%%time

#defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV


param_grid = {
"init":
[AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_stat
e=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations


scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5,
random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV


randomized_cv.fit(X_train, y_train) ## Complete the code to fit the
model on original data

print("Best parameters are {} with CV


score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score
_))

Best parameters are {'subsample': 0.9, 'n_estimators': 100,


'max_features': 0.5, 'learning_rate': 0.1, 'init':
AdaBoostClassifier(random_state=1)} with CV score=0.8376923076923077:
CPU times: total: 4.42 s
Wall time: 1min 57s

# Creating new pipeline with best parameters


tuned_gbm2 = GradientBoostingClassifier(
max_features=randomized_cv.best_params_['max_features'],
init=randomized_cv.best_params_['init'],
random_state=1,
learning_rate=randomized_cv.best_params_['learning_rate'],
n_estimators=randomized_cv.best_params_['n_estimators'],
subsample=randomized_cv.best_params_['subsample'],
)

# Fit the model on the original data


tuned_gbm2.fit(X_train, y_train)

GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1,
subsample=0.9)

Tuning Gradient Boosting using over sampled data


gbm2_train = model_performance_classification_sklearn(tuned_gbm2,
X_train_over, y_train_over) ## Complete the code to check the
performance on oversampled train set
gbm2_train
Accuracy Recall Precision F1
0 0.926 0.859 0.992 0.921

gbm2_val = model_performance_classification_sklearn(tuned_gbm2, X_val,


y_val) ## Complete the code to check the performance on validation set
gbm2_val

Accuracy Recall Precision F1


0 0.966 0.851 0.913 0.881

Tuning XGBoost Model with Original data


Note: This section is optional. You can choose not to build XGBoost if you are facing issues with
installation or if it is taking more time to execute.

%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in RandomSearchCV


param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
from sklearn import metrics

# Type of scoring used to compare parameter combinations


scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, n_jobs = -1,
scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV


randomized_cv.fit(X_train, y_train) ## Complete the code to fit the
model on original data

print("Best parameters are {} with CV


score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score
_))

Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5,


'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3} with CV
score=0.9346153846153846:
CPU times: total: 3.28 s
Wall time: 41.5 s
tuned_xgb = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=randomized_cv.best_params_['subsample'],
scale_pos_weight=randomized_cv.best_params_['scale_pos_weight'],
n_estimators=randomized_cv.best_params_['n_estimators'],
learning_rate=randomized_cv.best_params_['learning_rate'],
gamma=randomized_cv.best_params_['gamma'],
)

# Fit the model on the original data


tuned_xgb.fit(X_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,


colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=75,
n_jobs=None, num_parallel_tree=None,
random_state=1, ...)

xgb_train = model_performance_classification_sklearn(tuned_xgb,
X_train, y_train) ## Complete the code to check the performance on
original train set
xgb_train

Accuracy Recall Precision F1


0 0.976 0.992 0.874 0.929

xgb_val = model_performance_classification_sklearn(tuned_xgb, X_val,


y_val) ## Complete the code to check the performance on validation set
xgb_val

Accuracy Recall Precision F1


0 0.951 0.946 0.769 0.848

Model Comparison and Final Model Selection


Note: If you want to include XGBoost model for final model selection, you need to add
xgb_train.T in the training performance comparison list and xgb_val.T in the validation
performance comparison list below.
# training performance comparison

models_train_comp_df = pd.concat(
[
gbm1_train.T,
gbm2_train.T,
adb2_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Undersampled data",
]
print("Training performance comparison:")
models_train_comp_df

Training performance comparison:

Gradient boosting trained with Undersampled data \


Accuracy 0.975
Recall 0.978
Precision 0.972
F1 0.975

Gradient boosting trained with Original data \


Accuracy 0.926
Recall 0.859
Precision 0.992
F1 0.921

AdaBoost trained with Undersampled data


Accuracy 0.992
Recall 0.993
Precision 0.991
F1 0.992

# Validation performance comparison

models_val_comp_df = pd.concat(
[
gbm1_val.T,
gbm2_val.T,
adb2_val.T,
xgb_val.T, # Adding XGBoost validation performance
],
axis=1,
)
models_val_comp_df.columns = [
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Undersampled data",
"XGBoost trained with Original data", # Adding XGBoost column
]
print("Validation performance comparison:")
models_val_comp_df

Validation performance comparison:

Gradient boosting trained with Undersampled data \


Accuracy 0.937
Recall 0.946
Precision 0.714
F1 0.814

Gradient boosting trained with Original data \


Accuracy 0.966
Recall 0.851
Precision 0.913
F1 0.881

AdaBoost trained with Undersampled data \


Accuracy 0.937
Recall 0.946
Precision 0.714
F1 0.814

XGBoost trained with Original data


Accuracy 0.951
Recall 0.946
Precision 0.769
F1 0.848

Now we have our final model, so let's find out how our final model is performing on unseen
test data.

# Let's check the performance on test set


test_performance = model_performance_classification_sklearn(tuned_xgb,
X_test, y_test)
print("Test performance:")
test_performance
## Write the code to check the performance of best model on test data

Test performance:

Accuracy Recall Precision F1


0 0.955 0.937 0.817 0.873
Feature Importances
feature_names = X_train.columns
importances = tuned_xgb.feature_importances_ ## Complete the code to
check the feature importance of the best model
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet",
align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Business Insights and Conclusions

You might also like