Vertopal.com AML Project LearnerNotebook LowCode
Vertopal.com AML Project LearnerNotebook LowCode
Business Context
The Thera bank recently saw a steep decline in the number of users of their credit card, credit
cards are a good source of income for banks because of different kinds of fees charged by the
banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign
transaction fees, and others. Some fees are charged to every user irrespective of usage, while
others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze
the data of customers and identify the customers who will leave their credit card services and
reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help
the bank improve its services so that customers do not renounce their credit cards
Data Description
• CLIENTNUM: Client number. Unique identifier for the customer holding the account
• Attrition_Flag: Internal event (customer activity) variable - if the account is closed then
"Attrited Customer" else "Existing Customer"
• Customer_Age: Age in Years
• Gender: Gender of the account holder
• Dependent_count: Number of dependents
• Education_Level: Educational Qualification of the account holder - Graduate, High
School, Unknown, Uneducated, College(refers to college student), Post-Graduate,
Doctorate
• Marital_Status: Marital Status of the account holder
• Income_Category: Annual Income Category of the account holder
• Card_Category: Type of Card
• Months_on_book: Period of relationship with the bank (in months)
• Total_Relationship_Count: Total no. of products held by the customer
• Months_Inactive_12_mon: No. of months inactive in the last 12 months
• Contacts_Count_12_mon: No. of Contacts in the last 12 months
• Credit_Limit: Credit Limit on the Credit Card
• Total_Revolving_Bal: Total Revolving Balance on the Credit Card
• Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
• Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
• Total_Trans_Amt: Total Transaction Amount (Last 12 months)
• Total_Trans_Ct: Total Transaction Count (Last 12 months)
• Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
• Avg_Utilization_Ratio: Average Card Utilization Ratio
What Is a Revolving Balance?
• If we don't pay the balance of the revolving credit account in full every month, the unpaid
portion carries over to the next month. That's called a revolving balance
• Blanks '_______' are provided in the notebook that needs to be filled with an appropriate
code to get the correct result. With every '_______' blank, there is a comment that briefly
describes what needs to be filled in the blank space.
• Identify the task to be performed correctly, and only then proceed to write the required
code.
• Fill the code wherever asked by the commented lines like "# write your code here" or "#
complete the code". Running incomplete code may throw error.
• Please run the codes in a sequential manner from the beginning to avoid any
unnecessary errors.
• Add the results/observations (wherever mentioned) derived from the analysis in the
presentation and submit the same.
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
Data Overview
The initial steps to get an overview of any dataset is to:
• observe the first few rows of the dataset, to check whether the dataset has been loaded
properly or not
• get information about the number of rows and columns in the dataset
• find out the data types of the columns to ensure that data is stored in the preferred
format and the value of each property is as expected.
• check the statistical summary of the dataset to get an overview of the numerical columns
of the data
(10127, 21)
Avg_Utilization_Ratio
0 0.061
1 0.105
2 0.000
3 0.760
4 0.000
Months_on_book Total_Relationship_Count
Months_Inactive_12_mon \
10122 40 3
2
10123 25 4
2
10124 36 5
3
10125 36 4
3
10126 25 6
2
Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 0.857 0.462
10123 0.683 0.511
10124 0.818 0.000
10125 0.722 0.000
10126 0.649 0.189
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CLIENTNUM 10127 non-null int64
1 Attrition_Flag 10127 non-null object
2 Customer_Age 10127 non-null int64
3 Gender 10127 non-null object
4 Dependent_count 10127 non-null int64
5 Education_Level 8608 non-null object
6 Marital_Status 9378 non-null object
7 Income_Category 10127 non-null object
8 Card_Category 10127 non-null object
9 Months_on_book 10127 non-null int64
10 Total_Relationship_Count 10127 non-null int64
11 Months_Inactive_12_mon 10127 non-null int64
12 Contacts_Count_12_mon 10127 non-null int64
13 Credit_Limit 10127 non-null float64
14 Total_Revolving_Bal 10127 non-null int64
15 Avg_Open_To_Buy 10127 non-null float64
16 Total_Amt_Chng_Q4_Q1 10127 non-null float64
17 Total_Trans_Amt 10127 non-null int64
18 Total_Trans_Ct 10127 non-null int64
19 Total_Ct_Chng_Q4_Q1 10127 non-null float64
20 Avg_Utilization_Ratio 10127 non-null float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB
• There are a total of 21 columns and 10,000+ observations in the dataset
• We can see that 2 columns have around 1,000 non-null values i.e. columns have missing
values.
CLIENTNUM 0
Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
CLIENTNUM 10127
Attrition_Flag 2
Customer_Age 45
Gender 2
Dependent_count 6
Education_Level 6
Marital_Status 3
Income_Category 6
Card_Category 4
Months_on_book 44
Total_Relationship_Count 6
Months_Inactive_12_mon 7
Contacts_Count_12_mon 7
Credit_Limit 6205
Total_Revolving_Bal 1974
Avg_Open_To_Buy 6813
Total_Amt_Chng_Q4_Q1 1158
Total_Trans_Amt 5033
Total_Trans_Ct 126
Total_Ct_Chng_Q4_Q1 830
Avg_Utilization_Ratio 964
dtype: int64
• Customer_Age has only 45 unique values i.e. most of the customers are of similar age
• We have many continuous variables - Customer_Age, Credit_Limit and
Total_Revolving_Bal, for example.
• All other variables are categorical
data.describe(include=["object"]).T
for i in data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)
# CLIENTNUM consists of uniques ID for clients and hence will not add
value to the modeling
data.drop(["CLIENTNUM"], axis=1, inplace=True)
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True,
color="violet"
) # boxplot will be created and a triangle will indicate the mean
value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins,
palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is
False)
n: displays the top n category levels (default is None, i.e.,
display all levels)
"""
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the
category
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target],
margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target],
normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
target_uniq = data[target].unique()
plt.tight_layout()
plt.show()
Univariate analysis
Customer_Age
Months_on_book
histogram_boxplot(data,"Months_on_book") ## Complete the code to
create histogram_boxplot for 'New_Price'
Credit_Limit
Total_Trans_Ct
Total_Ct_Chng_Q4_Q1
labeled_barplot(data, "Dependent_count")
Total_Relationship_Count
Correlation Check
Attrition_Flag 0 1 All
Gender
All 8500 1627 10127
F 4428 930 5358
M 4072 697 4769
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Marital_Status
stacked_barplot(data,"Attrition_Flag", "Contacts_Count_12_mon") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Income_Category
Contacts_Count_12_mon 0 1 2 3 4 5 6 All
Attrition_Flag
1 7 108 403 681 315 59 54 1627
All 399 1499 3227 3380 1392 176 54 10127
0 392 1391 2824 2699 1077 117 0 8500
----------------------------------------------------------------------
--------------------------------------------------
Let's see the number of months a customer was inactive in the last 12 months
(Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)
Attrition_Flag vs Months_Inactive_12_mon
stacked_barplot(data,"Attrition_Flag", "Months_Inactive_12_mon") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Months_Inactive_12_mon
Months_Inactive_12_mon 0 1 2 3 4 5 6 All
Attrition_Flag
All 29 2233 3282 3846 435 178 124 10127
1 15 100 505 826 130 32 19 1627
0 14 2133 2777 3020 305 146 105 8500
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Total_Relationship_Count
stacked_barplot(data,"Attrition_Flag", "Total_Relationship_Count") ##
Complete the code to create distribution_plot for Attrition_Flag vs
Total_Relationship_Count
Total_Relationship_Count 1 2 3 4 5 6 All
Attrition_Flag
All 910 1243 2305 1912 1891 1866 10127
0 677 897 1905 1687 1664 1670 8500
1 233 346 400 225 227 196 1627
----------------------------------------------------------------------
--------------------------------------------------
Attrition_Flag vs Dependent_count
Dependent_count 0 1 2 3 4 5 All
Attrition_Flag
All 904 1838 2655 2732 1574 424 10127
0 769 1569 2238 2250 1314 360 8500
1 135 269 417 482 260 64 1627
----------------------------------------------------------------------
--------------------------------------------------
Total_Revolving_Bal vs Attrition_Flag
distribution_plot_wrt_target(data, "Total_Revolving_Bal",
"Attrition_Flag")
Attrition_Flag vs Credit_Limit
distribution_plot_wrt_target(data, "Total_Trans_Amt",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Total_Trans_Amt vs Attrition_Flag
Let's see the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1)
vary by the customer's account status (Attrition_Flag)
Total_Ct_Chng_Q4_Q1 vs Attrition_Flag
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Total_Ct_Chng_Q4_Q1 vs Attrition_Flag
Avg_Utilization_Ratio vs Attrition_Flag
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio",
"Attrition_Flag") ## Complete the code to create distribution_plot for
Avg_Utilization_Ratio vs Attrition_Flag
Attrition_Flag vs Months_on_book
distribution_plot_wrt_target(data, "Attrition_Flag",
"Total_Revolving_Bal") ## Complete the code to create
distribution_plot for Attrition_Flag vs Total_Revolving_Bal
Attrition_Flag vs Avg_Open_To_Buy
distribution_plot_wrt_target(data, "Attrition_Flag",
"Avg_Open_To_Buy") ## Complete the code to create distribution_plot
for Attrition_Flag vs Avg_Open_To_Buy
Data Preprocessing
Outlier Detection
Q1 = numeric_data.quantile(0.25) # To find the 25th percentile
Q3 = numeric_data.quantile(0.75) # To find the 75th percentile
# Finding lower and upper bounds for all values. All values outside
these bounds are outliers
lower = (Q1 - 1.5 * IQR)
upper = (Q3 + 1.5 * IQR)
Train-Test Split
# creating the copy of the dataframe
data1 = data.copy()
data1.isna().sum()
Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")
X = data1.drop(["Attrition_Flag"], axis=1)
y = data1["Attrition_Flag"]
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
------------------------------
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
------------------------------
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
dtype: int64
Gender
F 4279
M 3822
Name: count, dtype: int64
******************************
Education_Level
Graduate 3733
High School 1619
Uneducated 1171
College 816
Post-Graduate 407
Doctorate 355
Name: count, dtype: int64
******************************
Marital_Status
Married 4346
Single 3144
Divorced 611
Name: count, dtype: int64
******************************
Income_Category
Less than $40K 2812
$40K - $60K 1453
$80K - $120K 1237
$60K - $80K 1122
abc 889
$120K + 588
Name: count, dtype: int64
******************************
Card_Category
Blue 7557
Silver 436
Gold 93
Platinum 15
Name: count, dtype: int64
******************************
Gender
F 266
M 241
Name: count, dtype: int64
******************************
Education_Level
Graduate 237
High School 94
Uneducated 84
College 49
Doctorate 24
Post-Graduate 19
Name: count, dtype: int64
******************************
Marital_Status
Married 272
Single 193
Divorced 42
Name: count, dtype: int64
******************************
Income_Category
Less than $40K 174
$40K - $60K 88
$60K - $80K 74
$80K - $120K 71
abc 62
$120K + 38
Name: count, dtype: int64
******************************
Card_Category
Blue 465
Silver 37
Gold 3
Platinum 2
Name: count, dtype: int64
******************************
Marital_Status_Married Marital_Status_Single \
9066 False True
5814 True False
792 False True
1791 False True
5011 True False
Card_Category_Silver
9066 False
5814 False
792 False
1791 False
5011 False
Model Building
Model evaluation criterion
Model can make wrong predictions as:
• Predicting that customer will not attrite but he attrites i.e. losing on a valuable customer
or asset.
• Bank would want Recall to be maximized, greater the Recall higher the chances of
minimizing false negatives. Hence, the focus should be on increasing Recall or
minimizing the false negatives or in other words identifying the true positives(i.e. Class 1)
so that the bank can retain their valuable customers by identifying the customers who
are at risk of attrition.
Let's define a function to output different metrics (including recall) on the train and test set
and a function to show confusion matrix so that we do not have to use the same code
repetitively while evaluating models.
model: classifier
predictors: independent variables
target: dependent variable
"""
return df_perf
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item /
cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Training Performance:
Bagging: 0.9784615384615385
Random forest: 1.0
Validation Performance:
Bagging: 0.8513513513513513
Random forest: 0.7432432432432432
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
Training Performance:
Bagging: 0.9979414791942361
Random forest: 1.0
Validation Performance:
Bagging: 0.8918918918918919
Random forest: 0.8378378378378378
Training Performance:
Bagging: 0.9930769230769231
Random forest: 1.0
Validation Performance:
Bagging: 0.9054054054054054
Random forest: 0.918918918918919
Hyperparameter Tuning
Note
1. Sample parameter grids have been provided to do necessary hyperparameter tuning.
These sample grids are expected to provide a balance between model performance
improvement and execution time. One can extend/reduce the parameter grid based on
execution time and system configuration.
• Please note that if the parameter grid is extended to improve the model performance
further, the execution time will increase
1. The models chosen in this notebook are based on test runs. One can update the best
models as obtained upon code execution and tune them for best performance.
# defining model
Model = AdaBoostClassifier(random_state=1)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_jobs = -1, n_iter=50,
scoring=scorer, cv=5, random_state=1)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100,
random_state=1)
adb_train = model_performance_classification_sklearn(tuned_adb,
X_train, y_train) ## Complete the code to check the performance on
training set
adb_train
base_estimator=DecisionTreeClassifier(max_depth=randomized_cv.best_par
ams_['base_estimator'].max_depth, random_state=1)
)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100,
random_state=1)
adb2_train = model_performance_classification_sklearn(tuned_ada2,
X_train_un, y_train_un) ## Complete the code to check the performance
on training set
adb2_train
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5,
random_state=1, n_jobs = -1)
tuned_gbm1.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1,
subsample=0.9)
gbm1_train = model_performance_classification_sklearn(tuned_gbm1,
X_train_un, y_train_un) ## Complete the code to check the performance
on undersampled train set
gbm1_train
#defining model
Model = GradientBoostingClassifier(random_state=1)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5,
random_state=1, n_jobs = -1)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1,
subsample=0.9)
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model,
param_distributions=param_grid, n_iter=50, n_jobs = -1,
scoring=scorer, cv=5, random_state=1)
xgb_train = model_performance_classification_sklearn(tuned_xgb,
X_train, y_train) ## Complete the code to check the performance on
original train set
xgb_train
models_train_comp_df = pd.concat(
[
gbm1_train.T,
gbm2_train.T,
adb2_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
models_val_comp_df = pd.concat(
[
gbm1_val.T,
gbm2_val.T,
adb2_val.T,
xgb_val.T, # Adding XGBoost validation performance
],
axis=1,
)
models_val_comp_df.columns = [
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Undersampled data",
"XGBoost trained with Original data", # Adding XGBoost column
]
print("Validation performance comparison:")
models_val_comp_df
Now we have our final model, so let's find out how our final model is performing on unseen
test data.
Test performance:
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet",
align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Business Insights and Conclusions