0% found this document useful (0 votes)

14 views74 pages

AML Project LearnerNotebook LowCode

Thera bank is facing a decline in credit card users, prompting the need to analyze customer data to identify potential attrition and improve services. The document outlines a classification model to predict customer attrition based on various attributes such as age, gender, income, and card usage. It includes data descriptions, analysis instructions, and necessary libraries for implementation in a Jupyter Notebook.

Uploaded by

marysaranyag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views74 pages

AML Project LearnerNotebook LowCode

Uploaded by

marysaranyag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Problem Statement

Business Context
The Thera bank recently saw a steep decline in the number of users of their credit card, credit
cards are a good source of income for banks because of different kinds of fees charged by the
banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign
transaction fees, and others. Some fees are charged to every user irrespective of usage, while
others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze
the data of customers and identify the customers who will leave their credit card services and
reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help
the bank improve its services so that customers do not renounce their credit cards

Data Description
• CLIENTNUM: Client number. Unique identifier for the customer holding the account
• Attrition_Flag: Internal event (customer activity) variable - if the account is closed then
"Attrited Customer" else "Existing Customer"
• Customer_Age: Age in Years
• Gender: Gender of the account holder
• Dependent_count: Number of dependents
• Education_Level: Educational Qualification of the account holder - Graduate, High
School, Unknown, Uneducated, College(refers to college student), Post-Graduate,
Doctorate
• Marital_Status: Marital Status of the account holder
• Income_Category: Annual Income Category of the account holder
• Card_Category: Type of Card
• Months_on_book: Period of relationship with the bank (in months)
• Total_Relationship_Count: Total no. of products held by the customer
• Months_Inactive_12_mon: No. of months inactive in the last 12 months
• Contacts_Count_12_mon: No. of Contacts in the last 12 months
• Credit_Limit: Credit Limit on the Credit Card
• Total_Revolving_Bal: Total Revolving Balance on the Credit Card
• Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
• Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
• Total_Trans_Amt: Total Transaction Amount (Last 12 months)
• Total_Trans_Ct: Total Transaction Count (Last 12 months)
• Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
• Avg_Utilization_Ratio: Average Card Utilization Ratio
What Is a Revolving Balance?
• If we don't pay the balance of the revolving credit account in full every month, the unpaid
portion carries over to the next month. That's called a revolving balance

What is the Average Open to buy?

• 'Open to Buy' means the amount left on your credit card to use. Now, this column
represents the average of this value for the last 12 months.

What is the Average utilization Ratio?

• The Avg_Utilization_Ratio represents how much of the available credit the customer
spent. This is useful for calculating credit scores.

Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:

• ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Please read the instructions carefully before starting the

project.
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be
performed are mentioned.

• Blanks '_______' are provided in the notebook that needs to be filled with an appropriate
code to get the correct result. With every '_______' blank, there is a comment that briefly
describes what needs to be filled in the blank space.
• Identify the task to be performed correctly, and only then proceed to write the required
code.
• Fill the code wherever asked by the commented lines like "# write your code here" or "#
complete the code". Running incomplete code may throw error.
• Please run the codes in a sequential manner from the beginning to avoid any
unnecessary errors.
• Add the results/observations (wherever mentioned) derived from the analysis in the
presentation and submit the same.

Importing necessary libraries

# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1
numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q
--user

# Installing the libraries with the specified version.

# uncomment and run the following lines if Jupyter Notebook is being
used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1
numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells
sequentially from the start again.

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# To suppress scientific notations

pd.set_option("display.float_format", lambda x: "%.3f" % x)

# Libaries to help with data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data

from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
)
from sklearn.model_selection import train_test_split, StratifiedKFold,
cross_val_score

# To be used for data scaling and one hot encoding

from sklearn.preprocessing import StandardScaler, MinMaxScaler,
OneHotEncoder

# To impute missing values

from sklearn.impute import SimpleImputer

# To oversample and undersample data

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe

pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe

pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier

# To supress warnings
import warnings
warnings.filterwarnings("ignore")

Loading the dataset

churn = pd.read_csv("BankChurners.csv")

Data Overview
The initial steps to get an overview of any dataset is to:

• observe the first few rows of the dataset, to check whether the dataset has been loaded
properly or not
• get information about the number of rows and columns in the dataset
• find out the data types of the columns to ensure that data is stored in the preferred
format and the value of each property is as expected.
• check the statistical summary of the dataset to get an overview of the numerical columns
of the data

Checking the shape of the dataset

# Checking the number of rows and columns in the training data
churn.shape ## Complete the code to view dimensions of the train data

(10127, 21)

# let's create a copy of the data

data = churn.copy()

Displaying the first few rows of the dataset

# let's view the first 5 rows of the data
data.head() ## Complete the code to view top 5 rows of the data

CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count

\
0 768805383 Existing Customer 45 M 3

1 818770008 Existing Customer 49 F 5

2 713982108 Existing Customer 51 M 3

3 769911858 Existing Customer 40 F 4

4 709106358 Existing Customer 40 M 3

Education_Level Marital_Status Income_Category Card_Category \

0 High School Married $60K - $80K Blue
1 Graduate Single Less than $40K Blue
2 Graduate Married $80K - $120K Blue
3 High School NaN Less than $40K Blue
4 Uneducated Married $60K - $80K Blue

Months_on_book Total_Relationship_Count Months_Inactive_12_mon \

0 39 5 1
1 44 6 1
2 36 4 1
3 34 3 4
4 21 5 1

Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal

Avg_Open_To_Buy \
0 3 12691.000 777
11914.000
1 2 8256.000 864
7392.000
2 0 3418.000 0
3418.000
3 1 3313.000 2517
796.000
4 0 4716.000 0
4716.000

Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct

Total_Ct_Chng_Q4_Q1 \
0 1.335 1144 42
1.625
1 1.541 1291 33
3.714
2 2.594 1887 20
2.333
3 1.405 1171 20
2.333
4 2.175 816 28
2.500

Avg_Utilization_Ratio
0 0.061
1 0.105
2 0.000
3 0.760
4 0.000

# let's view the last 5 rows of the data

data.tail() ## Complete the code to view last 5 rows of the data

CLIENTNUM Attrition_Flag Customer_Age Gender

Dependent_count \
10122 772366833 Existing Customer 50 M
2
10123 710638233 Attrited Customer 41 M
2
10124 716506083 Attrited Customer 44 F
1
10125 717406983 Attrited Customer 30 M
2
10126 714337233 Attrited Customer 43 F
2

Education_Level Marital_Status Income_Category Card_Category \

10122 Graduate Single $40K - $60K Blue
10123 NaN Divorced $40K - $60K Blue
10124 High School Married Less than $40K Blue
10125 Graduate NaN $40K - $60K Blue
10126 Graduate Married Less than $40K Silver

Months_on_book Total_Relationship_Count
Months_Inactive_12_mon \
10122 40 3
2
10123 25 4
2
10124 36 5
3
10125 36 4
3
10126 25 6
2

Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal \

10122 3 4003.000 1851
10123 3 4277.000 2186
10124 4 5409.000 0
10125 3 5281.000 0
10126 4 10388.000 1961

Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt

Total_Trans_Ct \
10122 2152.000 0.703 15476
117
10123 2091.000 0.804 8764
69
10124 5409.000 0.819 10291
60
10125 5281.000 0.535 8395
62
10126 8427.000 0.703 10294
61

Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 0.857 0.462
10123 0.683 0.511
10124 0.818 0.000
10125 0.722 0.000
10126 0.649 0.189

Checking the data types of the columns for the dataset

# let's check the data types of the columns in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CLIENTNUM 10127 non-null int64
1 Attrition_Flag 10127 non-null object
2 Customer_Age 10127 non-null int64
3 Gender 10127 non-null object
4 Dependent_count 10127 non-null int64
5 Education_Level 8608 non-null object
6 Marital_Status 9378 non-null object
7 Income_Category 10127 non-null object
8 Card_Category 10127 non-null object
9 Months_on_book 10127 non-null int64
10 Total_Relationship_Count 10127 non-null int64
11 Months_Inactive_12_mon 10127 non-null int64
12 Contacts_Count_12_mon 10127 non-null int64
13 Credit_Limit 10127 non-null float64
14 Total_Revolving_Bal 10127 non-null int64
15 Avg_Open_To_Buy 10127 non-null float64
16 Total_Amt_Chng_Q4_Q1 10127 non-null float64
17 Total_Trans_Amt 10127 non-null int64
18 Total_Trans_Ct 10127 non-null int64
19 Total_Ct_Chng_Q4_Q1 10127 non-null float64
20 Avg_Utilization_Ratio 10127 non-null float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB
• There are a total of 21 columns and 10,000+ observations in the dataset
• We can see that 2 columns have around 1,000 non-null values i.e. columns have missing
values.

Checking for duplicate values

# let's check for duplicate values in the data
data.duplicated().sum() ## Complete the code to check duplicate
entries in the data

Checking for missing values

# let's check for missing values in the data
data.isnull().sum() ## Complete the code to check missing entries in
the train data

CLIENTNUM 0
Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64

Statistical summary of the dataset

# let's view the statistical summary of the numerical columns in the
data
data.nunique() ## Complete the code to print the statitical summary
of the train data

CLIENTNUM 10127
Attrition_Flag 2
Customer_Age 45
Gender 2
Dependent_count 6
Education_Level 6
Marital_Status 3
Income_Category 6
Card_Category 4
Months_on_book 44
Total_Relationship_Count 6
Months_Inactive_12_mon 7
Contacts_Count_12_mon 7
Credit_Limit 6205
Total_Revolving_Bal 1974
Avg_Open_To_Buy 6813
Total_Amt_Chng_Q4_Q1 1158
Total_Trans_Amt 5033
Total_Trans_Ct 126
Total_Ct_Chng_Q4_Q1 830
Avg_Utilization_Ratio 964
dtype: int64

• Customer_Age has only 45 unique values i.e. most of the customers are of similar age
• We have many continuous variables - Customer_Age, Credit_Limit and
Total_Revolving_Bal, for example.
• All other variables are categorical
data.describe(include=["object"]).T

count unique top freq

Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436

for i in data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)

Unique values in Attrition_Flag are :

Attrition_Flag
Existing Customer 8500
Attrited Customer 1627
Name: count, dtype: int64
**************************************************
Unique values in Gender are :
Gender
F 5358
M 4769
Name: count, dtype: int64
**************************************************
Unique values in Education_Level are :
Education_Level
Graduate 3128
High School 2013
Uneducated 1487
College 1013
Post-Graduate 516
Doctorate 451
Name: count, dtype: int64
**************************************************
Unique values in Marital_Status are :
Marital_Status
Married 4687
Single 3943
Divorced 748
Name: count, dtype: int64
**************************************************
Unique values in Income_Category are :
Income_Category
Less than $40K 3561
$40K - $60K 1790
$80K - $120K 1535
$60K - $80K 1402
abc 1112
$120K + 727
Name: count, dtype: int64
**************************************************
Unique values in Card_Category are :
Card_Category
Blue 9436
Silver 555
Gold 116
Platinum 20
Name: count, dtype: int64
**************************************************

# CLIENTNUM consists of uniques ID for clients and hence will not add
value to the modeling
data.drop(["CLIENTNUM"], axis=1, inplace=True)

## Encoding Existing and Attrited customers to 0 and 1 respectively,

for analysis.
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
Exploratory Data Analysis
The below functions need to be defined to carry out the Exploratory Data Analysis.
# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False,

bins=None):
"""
Boxplot and histogram combined

data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True,
color="violet"
) # boxplot will be created and a triangle will indicate the mean
value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins,
palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram

# function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):

"""
Barplot with percentage at the top

data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is
False)
n: displays the top n category levels (default is None, i.e.,
display all levels)
"""

total = len(data[feature]) # length of the column

count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))

plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)

for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the
category

x = p.get_x() + p.get_width() / 2 # width of the plot

y = p.get_height() # height of the plot

ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage

plt.show() # show the plot

# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):

"""
Print the category counts and plot a stacked bar chart

data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target],
margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target],
normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()

### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

fig, axs = plt.subplots(2, 2, figsize=(12, 10))

target_uniq = data[target].unique()

axs[0, 0].set_title("Distribution of target for target=" +

str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)

axs[0, 1].set_title("Distribution of target for target=" +

str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)

axs[1, 0].set_title("Boxplot w.r.t target")

sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0],
palette="gist_rainbow")

axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")

sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)

plt.tight_layout()
plt.show()

Univariate analysis
Customer_Age

histogram_boxplot(data, "Customer_Age", kde=True)

Months_on_book
histogram_boxplot(data,"Months_on_book") ## Complete the code to
create histogram_boxplot for 'New_Price'

Credit_Limit

histogram_boxplot(data, "Credit_Limit") ## Complete the code to create

histogram_boxplot for 'New_Price'
Total_Revolving_Bal

histogram_boxplot(data, "Total_Revolving_Bal") ## Complete the code

to create histogram_boxplot for 'New_Price'
Avg_Open_To_Buy

histogram_boxplot(data, "Avg_Open_To_Buy") ## Complete the code to

create histogram_boxplot for 'New_Price'

Total_Trans_Ct

histogram_boxplot(data, 'Total_Trans_Ct') ## Complete the code to

create histogram_boxplot for 'New_Price'
Total_Amt_Chng_Q4_Q1

histogram_boxplot(data, 'Total_Amt_Chng_Q4_Q1') ## Complete the code

to create histogram_boxplot for 'New_Price'
Let's see total transaction amount distributed
Total_Trans_Amt

histogram_boxplot(data, 'Total_Trans_Amt') ## Complete the code to

create histogram_boxplot for 'New_Price'

Total_Ct_Chng_Q4_Q1

histogram_boxplot(data, 'Total_Ct_Chng_Q4_Q1') ## Complete the code

to create histogram_boxplot for 'New_Price'
Avg_Utilization_Ratio

histogram_boxplot(data, 'Avg_Utilization_Ratio') ## Complete the code

to create histogram_boxplot for 'New_Price'
Dependent_count

labeled_barplot(data, "Dependent_count")

Total_Relationship_Count

labeled_barplot(data, 'Total_Relationship_Count') ## Complete the code

to create labeled_barplot for 'Total_Relationship_Count'
Months_Inactive_12_mon

labeled_barplot(data, 'Months_Inactive_12_mon') ## Complete the code

to create labeled_barplot for 'Months_Inactive_12_mon'
Contacts_Count_12_mon

labeled_barplot(data, 'Contacts_Count_12_mon') ## Complete the code to

create labeled_barplot for 'Contacts_Count_12_mon'
Gender

labeled_barplot(data, 'Gender') ## Complete the code to create

labeled_barplot for 'Gender'
Let's see the distribution of the level of education of customers
Education_Level

labeled_barplot(data, 'Education_Level') ## Complete the code to

create labeled_barplot for 'Education_Level'
Marital_Status

labeled_barplot(data, 'Marital_Status') ## Complete the code to create

labeled_barplot for 'Marital_Status'
Let's see the distribution of the level of income of customers
Income_Category

labeled_barplot(data, 'Income_Category') ## Complete the code to

create labeled_barplot for 'Income_Category'
Card_Category

labeled_barplot(data, 'Card_Category') ## Complete the code to create

labeled_barplot for 'Card_Category'
Attrition_Flag

labeled_barplot(data, 'Attrition_Flag') ## Complete the code to create

labeled_barplot for 'Attrition_Flag'
# creating histograms
data.hist(figsize=(14, 14))
plt.show()
Bivariate Distributions
Let's see the attributes that have a strong correlation with each other

Correlation Check

# Plot the heatmap for the correlation matrix

plt.figure(figsize=(15, 7))
sns.heatmap(numeric_data.corr(), annot=True, vmin=-1, vmax=1,
fmt=".2f", cmap="Spectral")
plt.show()
Attrition_Flag vs Gender

stacked_barplot(data, "Gender", "Attrition_Flag")

Attrition_Flag 0 1 All
Gender
All 8500 1627 10127
F 4428 930 5358
M 4072 697 4769
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Marital_Status

stacked_barplot(data,"Attrition_Flag", "Marital_Status") ## Complete

the code to create distribution_plot for Attrition_Flag vs
Marital_Status

Marital_Status Divorced Married Single All

Attrition_Flag
All 748 4687 3943 9378
0 627 3978 3275 7880
1 121 709 668 1498
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Education_Level

stacked_barplot(data,"Attrition_Flag", "Education_Level") ## Complete

the code to create distribution_plot for Attrition_Flag vs
Education_Level

Education_Level College Doctorate Graduate High School Post-

Graduate \
Attrition_Flag

All 1013 451 3128 2013

516
0 859 356 2641 1707
424
1 154 95 487 306
92

Education_Level Uneducated All

Attrition_Flag
All 1487 8608
0 1250 7237
1 237 1371
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Income_Category

stacked_barplot(data,"Attrition_Flag", "Income_Category") ## Complete

the code to create distribution_plot for Attrition_Flag vs
Income_Category

Income_Category $120K + $40K - $60K $60K - $80K $80K - $120K \

Attrition_Flag
All 727 1790 1402 1535
0 601 1519 1213 1293
1 126 271 189 242

Income_Category Less than $40K abc All

Attrition_Flag
All 3561 1112 10127
0 2949 925 8500
1 612 187 1627
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Contacts_Count_12_mon

stacked_barplot(data,"Attrition_Flag", "Contacts_Count_12_mon") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Income_Category

Contacts_Count_12_mon 0 1 2 3 4 5 6 All
Attrition_Flag
1 7 108 403 681 315 59 54 1627
All 399 1499 3227 3380 1392 176 54 10127
0 392 1391 2824 2699 1077 117 0 8500
----------------------------------------------------------------------
--------------------------------------------------
Let's see the number of months a customer was inactive in the last 12 months
(Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)
Attrition_Flag vs Months_Inactive_12_mon

stacked_barplot(data,"Attrition_Flag", "Months_Inactive_12_mon") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Months_Inactive_12_mon

Months_Inactive_12_mon 0 1 2 3 4 5 6 All
Attrition_Flag
All 29 2233 3282 3846 435 178 124 10127
1 15 100 505 826 130 32 19 1627
0 14 2133 2777 3020 305 146 105 8500
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Total_Relationship_Count

stacked_barplot(data,"Attrition_Flag", "Total_Relationship_Count") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Total_Relationship_Count

Total_Relationship_Count 1 2 3 4 5 6 All
Attrition_Flag
All 910 1243 2305 1912 1891 1866 10127
0 677 897 1905 1687 1664 1670 8500
1 233 346 400 225 227 196 1627
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Dependent_count

stacked_barplot(data,"Attrition_Flag", "Dependent_count") ## Complete

the code to create distribution_plot for Attrition_Flag vs
Dependent_count

Dependent_count 0 1 2 3 4 5 All
Attrition_Flag
All 904 1838 2655 2732 1574 424 10127
0 769 1569 2238 2250 1314 360 8500
1 135 269 417 482 260 64 1627
----------------------------------------------------------------------
--------------------------------------------------
Total_Revolving_Bal vs Attrition_Flag

distribution_plot_wrt_target(data, "Total_Revolving_Bal",
"Attrition_Flag")
Attrition_Flag vs Credit_Limit

distribution_plot_wrt_target(data, "Attrition_Flag", "Credit_Limit")

## Complete the code to create distribution_plot for Attrition_Flag vs
Credit_Limit
Attrition_Flag vs Customer_Age

distribution_plot_wrt_target(data, "Attrition_Flag", "Customer_Age")

## Complete the code to create distribution_plot for Attrition_Flag vs
Customer_Age
Total_Trans_Ct vs Attrition_Flag

distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")

## Complete the code to create distribution_plot for Total_Trans_Ct vs
Attrition_Flag
Total_Trans_Amt vs Attrition_Flag

distribution_plot_wrt_target(data, "Total_Trans_Amt",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Total_Trans_Amt vs Attrition_Flag
Let's see the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1)
vary by the customer's account status (Attrition_Flag)

Total_Ct_Chng_Q4_Q1 vs Attrition_Flag

distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Total_Ct_Chng_Q4_Q1 vs Attrition_Flag
Avg_Utilization_Ratio vs Attrition_Flag

distribution_plot_wrt_target(data, "Avg_Utilization_Ratio",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Avg_Utilization_Ratio vs Attrition_Flag
Attrition_Flag vs Months_on_book

distribution_plot_wrt_target(data, "Attrition_Flag", "Months_on_book")

## Complete the code to create distribution_plot for Attrition_Flag vs
Months_on_book
Attrition_Flag vs Total_Revolving_Bal

distribution_plot_wrt_target(data, "Attrition_Flag",
"Total_Revolving_Bal") ## Complete the code to create
distribution_plot for Attrition_Flag vs Total_Revolving_Bal
Attrition_Flag vs Avg_Open_To_Buy

distribution_plot_wrt_target(data, "Attrition_Flag",
"Avg_Open_To_Buy") ## Complete the code to create distribution_plot
for Attrition_Flag vs Avg_Open_To_Buy
Data Preprocessing
Outlier Detection
Q1 = numeric_data.quantile(0.25) # To find the 25th percentile
Q3 = numeric_data.quantile(0.75) # To find the 75th percentile

IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th

percentile)

# Finding lower and upper bounds for all values. All values outside
these bounds are outliers
lower = (Q1 - 1.5 * IQR)
upper = (Q3 + 1.5 * IQR)

# checking the % outliers

((data.select_dtypes(include=["float64", "int64"]) < lower) |
(data.select_dtypes(include=["float64", "int64"]) > upper)).sum() /
len(data) * 100
Attrition_Flag 16.066
Customer_Age 0.020
Dependent_count 0.000
Months_on_book 3.812
Total_Relationship_Count 0.000
Months_Inactive_12_mon 3.268
Contacts_Count_12_mon 6.211
Credit_Limit 9.717
Total_Revolving_Bal 0.000
Avg_Open_To_Buy 9.509
Total_Amt_Chng_Q4_Q1 3.910
Total_Trans_Amt 8.848
Total_Trans_Ct 0.020
Total_Ct_Chng_Q4_Q1 3.891
Avg_Utilization_Ratio 0.000
dtype: float64

Train-Test Split
# creating the copy of the dataframe
data1 = data.copy()

data1["Income_Category"].replace("Unknown", np.nan, inplace=True) ###

complete the code to replace the anomalous values with NaN

data1.isna().sum()

Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")

# Dividing train data into X and y

X = data1.drop(["Attrition_Flag"], axis=1)
y = data1["Attrition_Flag"]

# Splitting data into training and validation set:

X_train, X_temp, y_train, y_temp = train_test_split(X, y,

test_size=0.2, random_state=42) ## Complete the code to split the data
into train test in the ratio 80:20

X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp,

test_size=0.25, random_state=42) ## Complete the code to split the
data into train test in the ratio 75:25

print(X_train.shape, X_val.shape, X_test.shape)

(8101, 19) (507, 19) (1519, 19)

Missing value imputation

reqd_col_for_impute = ["Education_Level", "Marital_Status",
"Income_Category"]

# Fit and transform the train data

X_train[reqd_col_for_impute] =
imputer.fit_transform(X_train[reqd_col_for_impute])

# Transform the validation data

X_val[reqd_col_for_impute] =
imputer.transform(X_val[reqd_col_for_impute]) ## Complete the code to
impute missing values in X_val

# Transform the test data

X_test[reqd_col_for_impute] =
imputer.transform(X_test[reqd_col_for_impute]) ## Complete the code to
impute missing values in X_test

# Checking that no column has missing values in train or test sets

print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())

Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
------------------------------
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
------------------------------
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64

cols = X_train.select_dtypes(include=["object", "category"])

for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)

Gender
F 4279
M 3822
Name: count, dtype: int64
******************************
Education_Level
Graduate 3733
High School 1619
Uneducated 1171
College 816
Post-Graduate 407
Doctorate 355
Name: count, dtype: int64
******************************
Marital_Status
Married 4346
Single 3144
Divorced 611
Name: count, dtype: int64
******************************
Income_Category
Less than $40K 2812
$40K - $60K 1453
$80K - $120K 1237
$60K - $80K 1122
abc 889
$120K + 588
Name: count, dtype: int64
******************************
Card_Category
Blue 7557
Silver 436
Gold 93
Platinum 15
Name: count, dtype: int64
******************************

cols = X_val.select_dtypes(include=["object", "category"])

for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 30)

Gender
F 266
M 241
Name: count, dtype: int64
******************************
Education_Level
Graduate 237
High School 94
Uneducated 84
College 49
Doctorate 24
Post-Graduate 19
Name: count, dtype: int64
******************************
Marital_Status
Married 272
Single 193
Divorced 42
Name: count, dtype: int64
******************************
Income_Category
Less than $40K 174
$40K - $60K 88
$60K - $80K 74
$80K - $120K 71
abc 62
$120K + 38
Name: count, dtype: int64
******************************
Card_Category
Blue 465
Silver 37
Gold 3
Platinum 2
Name: count, dtype: int64
******************************

Encoding categorical variables

X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True) ## Complete the code
to impute missing values in X_val
X_test = pd.get_dummies(X_test, drop_first=True) ## Complete the code
to impute missing values in X_val
print(X_train.shape, X_val.shape, X_test.shape)

(8101, 30) (507, 30) (1519, 30)

• After encoding there are 29 columns.

# check the top 5 rows from the train dataset
X_train.head()

Customer_Age Dependent_count Months_on_book

Total_Relationship_Count \
9066 54 1 36
1
5814 58 4 48
1
792 45 4 36
6
1791 34 2 36
4
5011 49 2 39
5

Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit \

9066 3 3 3723.000
5814 4 3 5396.000
792 1 3 15987.000
1791 3 4 3625.000
5011 3 4 2720.000

Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 \

9066 1728 1995.000 0.595
5814 1803 3593.000 0.493
792 1648 14339.000 0.732
1791 2517 1108.000 1.158
5011 1926 794.000 0.602

Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 \

9066 8554 99 0.678
5814 2107 39 0.393
792 1436 36 1.250
1791 2616 46 1.300
5011 3806 61 0.794

Avg_Utilization_Ratio Gender_M Education_Level_Doctorate \

9066 0.464 False False
5814 0.334 False False
792 0.103 False False
1791 0.694 False False
5011 0.708 False False

Education_Level_Graduate Education_Level_High School \

9066 True False
5814 False True
792 True False
1791 True False
5011 False True
Education_Level_Post-Graduate Education_Level_Uneducated \
9066 False False
5814 False False
792 False False
1791 False False
5011 False False

Marital_Status_Married Marital_Status_Single \
9066 False True
5814 True False
792 False True
1791 False True
5011 True False

Income_Category_$40K - $60K Income_Category_$60K - $80K \

9066 False False
5814 False False
792 False False
1791 False False
5011 True False

Income_Category_$80K - $120K Income_Category_Less than $40K \

9066 False False
5814 False False
792 False True
1791 False True
5011 False False

Income_Category_abc Card_Category_Gold Card_Category_Platinum

\
9066 True False False

5814 True False False

792 False True False

1791 False False False

5011 False False False

Card_Category_Silver
9066 False
5814 False
792 False
1791 False
5011 False
Model Building
Model evaluation criterion
Model can make wrong predictions as:

• Predicting a customer will attrite and the customer doesn't attrite

• Predicting a customer will not attrite and the customer attrites

Which case is more important?

• Predicting that customer will not attrite but he attrites i.e. losing on a valuable customer
or asset.

How to reduce this loss i.e need to reduce False Negatives??

• Bank would want Recall to be maximized, greater the Recall higher the chances of
minimizing false negatives. Hence, the focus should be on increasing Recall or
minimizing the false negatives or in other words identifying the true positives(i.e. Class 1)
so that the bank can retain their valuable customers by identifying the customers who
are at risk of attrition.

Let's define a function to output different metrics (including recall) on the train and test set
and a function to show confusion matrix so that we do not have to use the same code
repetitively while evaluating models.

# defining a function to compute different metrics to check

performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors,
target):
"""
Function to compute different metrics to check classification
model performance

model: classifier
predictors: independent variables
target: dependent variable
"""

# predicting using the independent variables

pred = model.predict(predictors)

acc = accuracy_score(target, pred) # to compute Accuracy

recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score

# creating a dataframe of metrics

df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision,
"F1": f1,},
index=[0],
)

return df_perf

def confusion_matrix_sklearn(model, predictors, target):

"""
To plot the confusion_matrix with percentages

model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item /
cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")

Model Building - Original Data

models = [] # Empty list to store all the models

# Appending models into the list

models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest",
RandomForestClassifier(random_state=1)))
'_______' ## Complete the code to append remaining 3 models in the
list models

print("\n" "Training Performance:" "\n")

for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:

model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))

Training Performance:

Bagging: 0.9784615384615385
Random forest: 1.0

Validation Performance:

Bagging: 0.8513513513513513
Random forest: 0.7432432432432432

Model Building - Oversampled Data

print("Before Oversampling, counts of label 'Yes':
{}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \
n".format(sum(y_train == 0)))

sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

print("After Oversampling, counts of label 'Yes':

{}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \
n".format(sum(y_train_over == 0)))

print("After Oversampling, the shape of train_X:

{}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \
n".format(y_train_over.shape))

Before Oversampling, counts of label 'Yes': 1300

Before Oversampling, counts of label 'No': 6801

After Oversampling, counts of label 'Yes': 6801

After Oversampling, counts of label 'No': 6801

After Oversampling, the shape of train_X: (13602, 30)

After Oversampling, the shape of train_y: (13602,)

models = [] # Empty list to store all the models

# Appending models into the list

print("\n" "Training Performance:" "\n")

for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
## Complete the code to build models on oversampled data
print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:

model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))

Training Performance:

Bagging: 0.9979414791942361
Random forest: 1.0

Validation Performance:

Bagging: 0.8918918918918919
Random forest: 0.8378378378378378

Model Building - Undersampled Data

rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

print("Before Under Sampling, counts of label 'Yes':

{}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \
n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes':

{}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \
n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X:

{}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \
n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 1300
Before Under Sampling, counts of label 'No': 6801

After Under Sampling, counts of label 'Yes': 1300

After Under Sampling, counts of label 'No': 1300

After Under Sampling, the shape of train_X: (2600, 30)

After Under Sampling, the shape of train_y: (2600,)

models = [] # Empty list to store all the models

# Appending models into the list

print("\n" "Training Performance:" "\n")

for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un)) ##
Complete the code to build models on undersampled data
print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:

model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))

Training Performance:

Bagging: 0.9930769230769231
Random forest: 1.0

Validation Performance:

Bagging: 0.9054054054054054
Random forest: 0.918918918918919

Hyperparameter Tuning
Note
1. Sample parameter grids have been provided to do necessary hyperparameter tuning.
These sample grids are expected to provide a balance between model performance
improvement and execution time. One can extend/reduce the parameter grid based on
execution time and system configuration.
• Please note that if the parameter grid is extended to improve the model performance
further, the execution time will increase
1. The models chosen in this notebook are based on test runs. One can update the best
models as obtained upon code execution and tune them for best performance.

Tuning AdaBoost using original data

%%time

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV

param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}

# Type of scoring used to compare parameter combinations

scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_jobs = -1, n_iter=50,
scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV

randomized_cv.fit(X_train, y_train) ## Complete the code to fit the
model on original data

print("Best parameters are {} with CV

score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score
_))

Best parameters are {'n_estimators': 100, 'learning_rate': 0.1,

'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
with CV score=0.8699999999999999:
CPU times: total: 4.52 s
Wall time: 1min 24s

# Creating new pipeline with best parameters

tuned_adb = AdaBoostClassifier(
random_state=1,
n_estimators=randomized_cv.best_params_['n_estimators'],
learning_rate=randomized_cv.best_params_['learning_rate'],
base_estimator=DecisionTreeClassifier(max_depth=randomized_cv.best_par
ams_['base_estimator'].max_depth, random_state=1)
)

# Fit the model on the original data

tuned_adb.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,

random_state=1),
learning_rate=0.1, n_estimators=100,
random_state=1)

adb_train = model_performance_classification_sklearn(tuned_adb,
X_train, y_train) ## Complete the code to check the performance on
training set
adb_train

Accuracy Recall Precision F1

0 0.985 0.934 0.969 0.951

# Checking model's performance on validation set

adb_val = model_performance_classification_sklearn(tuned_adb, X_val,
y_val) ## Complete the code to check the performance on validation set
adb_val

Accuracy Recall Precision F1

0 0.966 0.892 0.880 0.886

Tuning Ada Boost using undersampled data

# Creating new pipeline with best parameters
tuned_ada2 = AdaBoostClassifier(
random_state=1,
n_estimators=randomized_cv.best_params_['n_estimators'],
learning_rate=randomized_cv.best_params_['learning_rate'],

base_estimator=DecisionTreeClassifier(max_depth=randomized_cv.best_par
ams_['base_estimator'].max_depth, random_state=1)
)

# Fit the model on undersampled data

tuned_ada2.fit(X_train_un, y_train_un)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,

random_state=1),
learning_rate=0.1, n_estimators=100,
random_state=1)
adb2_train = model_performance_classification_sklearn(tuned_ada2,
X_train_un, y_train_un) ## Complete the code to check the performance
on training set
adb2_train

Accuracy Recall Precision F1

0 0.992 0.993 0.991 0.992

# Checking model's performance on validation set

adb2_val = model_performance_classification_sklearn(tuned_ada2,
X_val, y_val) ## Complete the code to check the performance on
validation set
adb2_val

Accuracy Recall Precision F1

0 0.937 0.946 0.714 0.814

Tuning Gradient Boosting using undersampled data

%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV

param_grid = {
"init":
[AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_stat
e=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations

scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5,
random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV

randomized_cv.fit(X_train_un, y_train_un) ## Complete the code to fit
the model on under sampled data

print("Best parameters are {} with CV

score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score
_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100,
'max_features': 0.5, 'learning_rate': 0.1, 'init':
AdaBoostClassifier(random_state=1)} with CV score=0.9546153846153846:
CPU times: total: 2.42 s
Wall time: 58.7 s

# Creating new pipeline with best parameters

tuned_gbm1 = GradientBoostingClassifier(
max_features=randomized_cv.best_params_['max_features'],
init=AdaBoostClassifier(random_state=1),
random_state=1,
learning_rate=randomized_cv.best_params_['learning_rate'],
n_estimators=randomized_cv.best_params_['n_estimators'],
subsample=randomized_cv.best_params_['subsample'],
)## Complete the code with the best parameters obtained from tuning

tuned_gbm1.fit(X_train_un, y_train_un)

GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1,
subsample=0.9)

gbm1_train = model_performance_classification_sklearn(tuned_gbm1,
X_train_un, y_train_un) ## Complete the code to check the performance
on undersampled train set
gbm1_train

Accuracy Recall Precision F1

0 0.975 0.978 0.972 0.975

gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_val,

y_val) ## Complete the code to check the performance on validation set
gbm1_val

Accuracy Recall Precision F1

0 0.937 0.946 0.714 0.814

Tuning Gradient Boosting using original data

%%time

#defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV

# Type of scoring used to compare parameter combinations

scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5,
random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV

randomized_cv.fit(X_train, y_train) ## Complete the code to fit the
model on original data

print("Best parameters are {} with CV

score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score
_))

Best parameters are {'subsample': 0.9, 'n_estimators': 100,

'max_features': 0.5, 'learning_rate': 0.1, 'init':
AdaBoostClassifier(random_state=1)} with CV score=0.8376923076923077:
CPU times: total: 4.42 s
Wall time: 1min 57s

# Creating new pipeline with best parameters

tuned_gbm2 = GradientBoostingClassifier(
max_features=randomized_cv.best_params_['max_features'],
init=randomized_cv.best_params_['init'],
random_state=1,
learning_rate=randomized_cv.best_params_['learning_rate'],
n_estimators=randomized_cv.best_params_['n_estimators'],
subsample=randomized_cv.best_params_['subsample'],
)

# Fit the model on the original data

tuned_gbm2.fit(X_train, y_train)

GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1,
subsample=0.9)

Tuning Gradient Boosting using over sampled data

gbm2_train = model_performance_classification_sklearn(tuned_gbm2,
X_train_over, y_train_over) ## Complete the code to check the
performance on oversampled train set
gbm2_train
Accuracy Recall Precision F1
0 0.926 0.859 0.992 0.921

gbm2_val = model_performance_classification_sklearn(tuned_gbm2, X_val,

y_val) ## Complete the code to check the performance on validation set
gbm2_val

Accuracy Recall Precision F1

0 0.966 0.851 0.913 0.881

Tuning XGBoost Model with Original data

Note: This section is optional. You can choose not to build XGBoost if you are facing issues with
installation or if it is taking more time to execute.

%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in RandomSearchCV

param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
from sklearn import metrics

# Type of scoring used to compare parameter combinations

scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, n_jobs = -1,
scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV

randomized_cv.fit(X_train, y_train) ## Complete the code to fit the
model on original data

print("Best parameters are {} with CV

score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score
_))

Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5,

'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3} with CV
score=0.9346153846153846:
CPU times: total: 3.28 s
Wall time: 41.5 s
tuned_xgb = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=randomized_cv.best_params_['subsample'],
scale_pos_weight=randomized_cv.best_params_['scale_pos_weight'],
n_estimators=randomized_cv.best_params_['n_estimators'],
learning_rate=randomized_cv.best_params_['learning_rate'],
gamma=randomized_cv.best_params_['gamma'],
)

# Fit the model on the original data

tuned_xgb.fit(X_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,

colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=75,
n_jobs=None, num_parallel_tree=None,
random_state=1, ...)

xgb_train = model_performance_classification_sklearn(tuned_xgb,
X_train, y_train) ## Complete the code to check the performance on
original train set
xgb_train

Accuracy Recall Precision F1

0 0.976 0.992 0.874 0.929

xgb_val = model_performance_classification_sklearn(tuned_xgb, X_val,

y_val) ## Complete the code to check the performance on validation set
xgb_val

Accuracy Recall Precision F1

0 0.951 0.946 0.769 0.848

Model Comparison and Final Model Selection

Note: If you want to include XGBoost model for final model selection, you need to add
xgb_train.T in the training performance comparison list and xgb_val.T in the validation
performance comparison list below.
# training performance comparison

models_train_comp_df = pd.concat(
[
gbm1_train.T,
gbm2_train.T,
adb2_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Undersampled data",
]
print("Training performance comparison:")
models_train_comp_df

Training performance comparison:

Gradient boosting trained with Undersampled data \

Accuracy 0.975
Recall 0.978
Precision 0.972
F1 0.975

Gradient boosting trained with Original data \

Accuracy 0.926
Recall 0.859
Precision 0.992
F1 0.921

AdaBoost trained with Undersampled data

Accuracy 0.992
Recall 0.993
Precision 0.991
F1 0.992

# Validation performance comparison

models_val_comp_df = pd.concat(
[
gbm1_val.T,
gbm2_val.T,
adb2_val.T,
xgb_val.T, # Adding XGBoost validation performance
],
axis=1,
)
models_val_comp_df.columns = [
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Undersampled data",
"XGBoost trained with Original data", # Adding XGBoost column
]
print("Validation performance comparison:")
models_val_comp_df

Validation performance comparison:

Gradient boosting trained with Undersampled data \

Accuracy 0.937
Recall 0.946
Precision 0.714
F1 0.814

Gradient boosting trained with Original data \

Accuracy 0.966
Recall 0.851
Precision 0.913
F1 0.881

AdaBoost trained with Undersampled data \

Accuracy 0.937
Recall 0.946
Precision 0.714
F1 0.814

XGBoost trained with Original data

Accuracy 0.951
Recall 0.946
Precision 0.769
F1 0.848

Now we have our final model, so let's find out how our final model is performing on unseen
test data.

# Let's check the performance on test set

test_performance = model_performance_classification_sklearn(tuned_xgb,
X_test, y_test)
print("Test performance:")
test_performance
## Write the code to check the performance of best model on test data

Test performance:

Accuracy Recall Precision F1

0 0.955 0.937 0.817 0.873
Feature Importances
feature_names = X_train.columns
importances = tuned_xgb.feature_importances_ ## Complete the code to
check the feature importance of the best model
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet",
align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Business Insights and Conclusions

SEA Practice Test 1 Mathematics
No ratings yet
SEA Practice Test 1 Mathematics
15 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
Files Exercises (Consumer Theory)
75% (4)
Files Exercises (Consumer Theory)
47 pages
Python Lab Manual
No ratings yet
Python Lab Manual
33 pages
Week 1 2ND Quarter Modules Gen Math
100% (3)
Week 1 2ND Quarter Modules Gen Math
32 pages
Mathematics Paper 3 Important Questions
No ratings yet
Mathematics Paper 3 Important Questions
4 pages
Diwali Sales Analysis EDA 1696347982
No ratings yet
Diwali Sales Analysis EDA 1696347982
8 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
Machine Learning - Project
80% (10)
Machine Learning - Project
14 pages
Quiz 9 - Chap 10
No ratings yet
Quiz 9 - Chap 10
3 pages
2.inverse Trigonometric Functions
No ratings yet
2.inverse Trigonometric Functions
121 pages
Ultimate Bearing Capacity of Shallow Foundations
No ratings yet
Ultimate Bearing Capacity of Shallow Foundations
15 pages
Telecom Churn Report
No ratings yet
Telecom Churn Report
66 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
Small CorelDraw Tutorial E - Book
No ratings yet
Small CorelDraw Tutorial E - Book
28 pages
Effective Analytics for Marketing
From Everand
Effective Analytics for Marketing
Sucheta Kakkar
No ratings yet
Project 3 Thera Bank
100% (1)
Project 3 Thera Bank
24 pages
Cs433 Sp12 Midterm Sol
No ratings yet
Cs433 Sp12 Midterm Sol
9 pages
Data Analysis Process
No ratings yet
Data Analysis Process
95 pages
Customer Churn Syntax
No ratings yet
Customer Churn Syntax
66 pages
Problems: Section 8.3: The Response of A First Order Circuit To A Constant Input P8.3-1
No ratings yet
Problems: Section 8.3: The Response of A First Order Circuit To A Constant Input P8.3-1
14 pages
Vijay Shankar Customer Churn Random Forest Hyperparameter Tuning
No ratings yet
Vijay Shankar Customer Churn Random Forest Hyperparameter Tuning
40 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Mlproj
No ratings yet
Mlproj
49 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
30 pages
AIML Lab Ex 3-5 - 1
No ratings yet
AIML Lab Ex 3-5 - 1
31 pages
Early Childhood Geometry
No ratings yet
Early Childhood Geometry
5 pages
Business Report - ML
No ratings yet
Business Report - ML
25 pages
Churn Prediction Model
No ratings yet
Churn Prediction Model
36 pages
Kushal Kadayat
No ratings yet
Kushal Kadayat
33 pages
Predictive Modeling
No ratings yet
Predictive Modeling
42 pages
Business Case Study Walmart New
No ratings yet
Business Case Study Walmart New
37 pages
Apsc 160 (Ubc)
No ratings yet
Apsc 160 (Ubc)
7 pages
Improvement of One Factor at A Time Through Design of Experiments
No ratings yet
Improvement of One Factor at A Time Through Design of Experiments
6 pages
Profitanalysis
No ratings yet
Profitanalysis
18 pages
Dinesh DWDM CCE
No ratings yet
Dinesh DWDM CCE
17 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Lab Programmes Adwaith
No ratings yet
Lab Programmes Adwaith
18 pages
Convex Hull Algorithms - Divide and Conquer - Algorithm
No ratings yet
Convex Hull Algorithms - Divide and Conquer - Algorithm
3 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
23 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Data Mining Journal 1 Kashan
No ratings yet
Data Mining Journal 1 Kashan
13 pages
Python Report Ritik
No ratings yet
Python Report Ritik
15 pages
ML Cops
No ratings yet
ML Cops
17 pages
Teaching Manual: Submitted By, Sajimi.S Mathematics Reg. No - 13375016 Submitted To, Shemimole. A
No ratings yet
Teaching Manual: Submitted By, Sajimi.S Mathematics Reg. No - 13375016 Submitted To, Shemimole. A
10 pages
Practical 3
No ratings yet
Practical 3
8 pages
Ensemmmmm
No ratings yet
Ensemmmmm
10 pages
Howxtre
No ratings yet
Howxtre
8 pages
Practicals IP-12 1-4
No ratings yet
Practicals IP-12 1-4
9 pages
Project Sale Analysis
No ratings yet
Project Sale Analysis
8 pages
ECN190 Term Project: Predicting Credit Card Default Risk: Introduction and Literature
No ratings yet
ECN190 Term Project: Predicting Credit Card Default Risk: Introduction and Literature
18 pages
Credit Card Default
No ratings yet
Credit Card Default
5 pages
Note 4
No ratings yet
Note 4
18 pages
Magnetism
No ratings yet
Magnetism
19 pages
Exercise Univariate Analysis - Andoni Fikri - 13118111
No ratings yet
Exercise Univariate Analysis - Andoni Fikri - 13118111
9 pages
2021 P11 Wk03 WS Vectors, Motion and Forces - 1617103209579 - ENT6T
No ratings yet
2021 P11 Wk03 WS Vectors, Motion and Forces - 1617103209579 - ENT6T
3 pages
Difference Between C and C++ With Example. C Vs C++
No ratings yet
Difference Between C and C++ With Example. C Vs C++
42 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Tarea 4
No ratings yet
Tarea 4
6 pages
Prakhar-Iii Course Planner - 2025-26
No ratings yet
Prakhar-Iii Course Planner - 2025-26
8 pages
Business Intelligence and Analytics
No ratings yet
Business Intelligence and Analytics
8 pages
MGNM - 801 - Ca1
No ratings yet
MGNM - 801 - Ca1
14 pages
Aosdijfpqoiew
No ratings yet
Aosdijfpqoiew
6 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
No ratings yet
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
18 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
Bank Marketing Data Set Analysis
No ratings yet
Bank Marketing Data Set Analysis
33 pages
Data Science Sample
No ratings yet
Data Science Sample
5 pages
Jan-2011 Engineering Graphics Anna University
No ratings yet
Jan-2011 Engineering Graphics Anna University
2 pages
Lab3.ipynb - Colaboratory
No ratings yet
Lab3.ipynb - Colaboratory
7 pages
Banking Analysis
No ratings yet
Banking Analysis
2 pages
Btech1010622 Lab4
No ratings yet
Btech1010622 Lab4
4 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Courseworksheet (MD)
No ratings yet
Courseworksheet (MD)
3 pages
Retail Analysis With Walmart Data
100% (10)
Retail Analysis With Walmart Data
2 pages
Asset-V1 MITx+CTL - sc0x+2T2023+Type@Asset+Block@Module 3 Clean
No ratings yet
Asset-V1 MITx+CTL - sc0x+2T2023+Type@Asset+Block@Module 3 Clean
130 pages
BUS 4055 Week 5
No ratings yet
BUS 4055 Week 5
16 pages
Advance Operations On Dataframes: Create A Dataframe With Following Values
No ratings yet
Advance Operations On Dataframes: Create A Dataframe With Following Values
3 pages
E1Qalg B
No ratings yet
E1Qalg B
2 pages
Uji Asumsi Normalitas: Statistics
No ratings yet
Uji Asumsi Normalitas: Statistics
3 pages
Lab 1 ML
No ratings yet
Lab 1 ML
2 pages
WC XII Artificial Intelligence 843 AY 2023 24 QP SET1 1
No ratings yet
WC XII Artificial Intelligence 843 AY 2023 24 QP SET1 1
5 pages
Intermediate Algebra Posttest
No ratings yet
Intermediate Algebra Posttest
3 pages
Programs On Python (AutoRecovered)
No ratings yet
Programs On Python (AutoRecovered)
5 pages
2021 Staar 8 Math Key
No ratings yet
2021 Staar 8 Math Key
1 page
Real Estate Math Express: Rapid Review and Practice with Essential License Exam Calculations
From Everand
Real Estate Math Express: Rapid Review and Practice with Essential License Exam Calculations
Stephen Mettling
No ratings yet
SYMBOLS
No ratings yet
SYMBOLS
6 pages