0% found this document useful (0 votes)
41 views41 pages

Germany Credit Analysis

The document describes a dataset containing information about customers of a German bank. The bank wants to build a predictive model to identify customers who may default on loans based on their demographic and financial information. The dataset contains 1000 customers described by variables like age, job, credit history and loan details. Some variables in the dataset have missing values that will need to be addressed.

Uploaded by

Andrew Eng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views41 pages

Germany Credit Analysis

The document describes a dataset containing information about customers of a German bank. The bank wants to build a predictive model to identify customers who may default on loans based on their demographic and financial information. The dataset contains 1000 customers described by variables like age, job, credit history and loan details. Some variables in the dataset have missing values that will need to be addressed.

Uploaded by

Andrew Eng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

German Credit Analysis

Context
When a bank receives a loan application, based on the applicant’s profile the bank has to
decide whether to go ahead with the loan approval or not. Two types of risks are
associated with the bank’s decision –
If the applicant is a good credit risk, i.e. is likely to repay the loan, then not
approving the loan to the person results in a loss of business to the bank
If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving
the loan to the person results in a financial loss to the bank
To minimize this loss HRE bank wants to automate this process using a predictive model,
that will predict if a customer is at risk of making a default or not based on the
customer’s demographic and socio-economic profiles
You as a Data scientist at HRE bank has been assigned the work of building a predictive
model that will predict if a customer is at risk of default or not
Objective
The objective is to build a model to predict whether a person would default or not. In this
dataset, the target variable is 'Risk'.
Dataset Description
Age (Numeric: Age in years)
Sex (Categories: male, female)
Job (Categories : 0 - unskilled and non-resident, 1 - unskilled and resident, 2 -
skilled, 3 - highly skilled)
Housing (Categories: own, rent, or free)
Saving accounts (Categories: little, moderate, quite rich, rich)
Checking account (Categories: little, moderate, rich)
Credit amount (Numeric: Amount of credit in DM - Deutsche Mark)
Duration (Numeric: Duration for which the credit is given in months)
Purpose (Categories: car, furniture/equipment, radio/TV, domestic appliances,
repairs, education, business, vacation/others)
Risk (0 - Person is not at risk, 1 - Person is at risk(defaulter))

Importing libraries
In [2]: # To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization


%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# To be used for missing value imputation


from sklearn.impute import SimpleImputer

# To help with model building


from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier

# To get different metric scores, and split data


from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)

# To be used for data scaling and one hot encoding


from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncode

# To be used for tuning the model


from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating pipelines and personalizing them


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe


pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe


pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (go
%load_ext nb_black

The nb_black extension is already loaded. To reload it, use:


%reload_ext nb_black

Loading Data
In [3]: # Loading the dataset
german = pd.read_csv("German_Credit.csv")
In [4]: # Checking the number of rows and columns in the data
german.shape

(1000, 10)
Out[4]:

Data Overview
In [5]: data = german.copy()

In [6]: # let's view the first 5 rows of the data


data.head()

Out[6]:
Age Saving Checking Credit Duration
Sex Job Housing accounts Purpose R
account amount
0 67 male 2 own NaN little 1169 6 radio/TV
1 22 female 2 own little moderate 5951 48 radio/TV
2 49 male 1 own little NaN 2096 12 education
3 45 male 2 free little little 7882 42 furniture/equipment
4 53 male 2 free little little 4870 24 car

In [7]: # let's view the last 5 rows of the data


data.tail()

Out[7]: Age Saving Checking Credit Duration


Sex Job Housing accounts Purpose
account amount
995 31 female 1 own little NaN 1736 12 furniture/equipment
996 40 male 3 own little little 3857 30 car
997 38 male 2 own little NaN 804 12 radio/TV
998 23 male 2 free little little 1845 45 radio/TV
999 27 male 2 own moderate moderate 4576 45 car

In [8]: # let's check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1000 non-null int64
1 Sex 1000 non-null object
2 Job 1000 non-null int64
3 Housing 1000 non-null object
4 Saving accounts 817 non-null object
5 Checking account 606 non-null object
6 Credit amount 1000 non-null int64
7 Duration 1000 non-null int64
8 Purpose 1000 non-null object
9 Risk 1000 non-null int64
dtypes: int64(5), object(5)
memory usage: 78.2+ KB

There are a total of 10 columns and 1,000 observations in the dataset


We can see that 2 columns have less than 1,000 non-null values i.e. columns have
missing values.
In [9]: # let's check for duplicate values in the data
data.duplicated().sum()

0
Out[9]:

In [10]: # let's check for missing values in the data


round(data.isnull().sum() / data.isnull().count() * 100, 2)

Age 0.000
Out[10]:
Sex 0.000
Job 0.000
Housing 0.000
Saving accounts 18.300
Checking account 39.400
Credit amount 0.000
Duration 0.000
Purpose 0.000
Risk 0.000
dtype: float64

Saving accounts column has 18.3% missing values out of the total
observations.
Checking account column has 39.4% missing values out of the total
observations.
We will impute these values after splitting the data into train,validation and test sets.
In [11]: # Checking for the null value in the dataset
data.isna().sum()
Age 0
Out[11]:
Sex 0
Job 0
Housing 0
Saving accounts 183
Checking account 394
Credit amount 0
Duration 0
Purpose 0
Risk 0
dtype: int64

Let's check the number of unique values in each column


In [12]: data.nunique()

Age 53
Out[12]:
Sex 2
Job 4
Housing 3
Saving accounts 4
Checking account 3
Credit amount 921
Duration 33
Purpose 8
Risk 2
dtype: int64

Age has only 53 unique values i.e. most of the customers are of similar age
We have only three continuous variables - Age, Credit Amount and Duration.
All other variables are categorical
In [13]: # let's view the statistical summary of the numerical columns in the data
data.describe().T

Out[13]: count mean std min 25% 50% 75% max


Age 1000.000 35.546 11.375 19.000 27.000 33.000 42.000 75.000
Job 1000.000 1.904 0.654 0.000 2.000 2.000 2.000 3.000
Credit 1000.000 3271.258 2822.737 250.000 1365.500 2319.500 3972.250 18424.000
amount
Duration 1000.000 20.903 12.059 4.000 12.000 18.000 24.000 72.000
Risk 1000.000 0.300 0.458 0.000 0.000 0.000 1.000 1.000

Mean value for the age column is approx 35 and the median is 33. This shows that
majority of the customers are under 35 years of age.
Mean amount of credit is approx 3,271 but it has a wide range of 250 to 18,424. We
will explore this further in univariate analysis.
Mean duration for which the credit is given is approx 21 months.
Checking the value count for each category of categorical variables
In [14]: # Making a list of all catrgorical variables
cat_col = [
"Sex",
"Job",
"Housing",
"Saving accounts",
"Checking account",
"Purpose",
"Risk",
]

# Printing number of count of each unique value in each column


for column in cat_col:
print(data[column].value_counts())
print("-" * 40)

male 690
female 310
Name: Sex, dtype: int64
----------------------------------------
2 630
1 200
3 148
0 22
Name: Job, dtype: int64
----------------------------------------
own 713
rent 179
free 108
Name: Housing, dtype: int64
----------------------------------------
little 603
moderate 103
quite rich 63
rich 48
Name: Saving accounts, dtype: int64
----------------------------------------
little 274
moderate 269
rich 63
Name: Checking account, dtype: int64
----------------------------------------
car 337
radio/TV 280
furniture/equipment 181
business 97
education 59
repairs 22
vacation/others 12
domestic appliances 12
Name: Purpose, dtype: int64
----------------------------------------
0 700
1 300
Name: Risk, dtype: int64
----------------------------------------

We have more male customers as compared to female customers


There are very few observations i.e. only 22 for customers with job category -
unskilled and non-resident
We can see that the distribution of classes in the target variable is imbalanced i.e.
only 30% observations with defaulters.

Univariate analysis
In [15]: # function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):


"""
Boxplot and histogram combined

data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="wint
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram

Observation on Age
In [16]: # Observations on Customer_age
histogram_boxplot(data, "Age")
The distribution of age is right-skewed
The boxplot shows that there are outliers at the right end
We will not treat these outliers as they represent the real market trend
Observation on Credit Amount
In [17]: histogram_boxplot(data, "Credit amount")

The distribution of the credit amount is right-skewed


The boxplot shows that there are outliers at the right end
We will not treat these outliers as they represent the real market trend
Observations on Duration
In [18]: histogram_boxplot(data, "Duration")

The distribution of the duration for which the credit is given is right-skewed
The boxplot shows that there are outliers at the right end
We will not treat these outliers as they represent the real market trend
In [19]: # function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):


"""
Barplot with percentage at the top

data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display al
"""

total = len(data[feature]) # length of the column


count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))

plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)

for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category

x = p.get_x() + p.get_width() / 2 # width of the plot


y = p.get_height() # height of the plot

ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage

plt.show() # show the plot

Observations on Risk
In [20]: # observations on Risk
labeled_barplot(data, "Risk")

As mentioned earlier, the class distribution in the target variable is imbalanced.


We have 70% observations for non-defaulters and 30% observations for defaulters.
Observations on Sex of Customers
In [21]: # observations on Sex
labeled_barplot(data, "Sex")
Male customers are taking more credit than female customers
There are approx 69% male customers and 31% are the female customers
Observations on Housing
In [22]: # observations on Housing
labeled_barplot(data, "Housing")

Major of the customers, approx 71%, who take credit have their own house
Approx 18% of customers are living in a rented house
There are only 11% of customers who have free housing. These are the customers
who live in a house given by their company or organization
Observations on Job
In [23]: # observations on Job
labeled_barplot(data, "Job")

Majority of the customers i.e. 63% fall into the skilled category.
There are only approx 15% of customers that lie in the highly skilled category which
makes sense as these may be the persons with high education or highly
experienced.
There are very few observations, approx 22%, with 0 or 1 job category.
Observations on Saving accounts
In [24]: # observations on Saving accounts
labeled_barplot(data, "Saving accounts")
Approx 70% of customers who take credit have a little or moderate amount in their
savings account. This makes sense as these customers would need credit more
than the other categories.
Approx 11% of customers who take credit are in a rich category based on their
balance in the savings account.
Note that the percentages do not add up to 100 as we have missing values in this
column.
Observations on Checking account
In [25]: # observations on Checking account
labeled_barplot(data, "Checking account")
Approx 54% of customers who take credit have a little or moderate amount in their
checking account. This makes sense as these customers would need credit more
than the other categories.
Approx 6% of customers who take credit are in the rich category based on their
balance in checking account.
Note that the percentages do not add up to 100 as we have missing values in this
column.
Observations on Purpose
In [26]: # observations on Purpose
labeled_barplot(data, "Purpose")
The plot shows that most customers take credit for luxury items like cars, radio or
furniture/equipment, domestic appliances.
Approximately just 16% of customers take credit for business or education

Bivariate Analysis
In [27]: sns.pairplot(data, hue="Risk")

<seaborn.axisgrid.PairGrid at 0x267080d8ac0>
Out[27]:
There are overlaps i.e. no clear distinction in the distribution of variables for people
who have defaulted and did not default.
Let's explore this further with the help of other plots.
In [28]: sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Age", data=data, orient="vertical")

<matplotlib.axes._subplots.AxesSubplot at 0x267075b2d00>
Out[28]:
We can see that the median age of defaulters is less than the median age of non-
defaulters.
This shows that younger customers are more likely to default.
There are outliers in boxplots of both class distributions
In [29]: sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Credit amount", data=data, orient="vertical")

<matplotlib.axes._subplots.AxesSubplot at 0x267080cfa60>
Out[29]:
We can see that the third quartile amount of defaulters is much more than the third
quartile amount of non-defaulters.
This shows that customers with high credit amounts are more likely to default.
There are outliers in boxplots of both class distributions
In [30]: sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Duration", data=data, orient="vertical")

<matplotlib.axes._subplots.AxesSubplot at 0x2670769e400>
Out[30]:
We can see that the second and third quartile duration of defaulters is much more
than the second and third quartile duration of non-defaulters.
This shows that customers with high duration are more likely to default.
In [31]: sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Saving accounts", y="Age", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x2670b4509d0>
Out[31]:
The plot shows that customers with higher age are in the rich or quite rich category.
Age of the customers in the little and moderate category is slightly less but there
are outliers in both of the distributions.
In [32]: # function to plot stacked bar chart

def stacked_barplot(data, predictor, target):


"""
Print the category counts and plot a stacked bar chart

data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_val
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()

In [33]: stacked_barplot(data, "Sex", "Risk")

Risk 0 1 All
Sex
All 700 300 1000
male 499 191 690
female 201 109 310
---------------------------------------------------------------------------
---------------------------------------------
We saw earlier that the percentage of male customers is more than the female
customers. This plot shows that female customers are more likely to default as
compared to male customers.
In [34]: stacked_barplot(data, "Job", "Risk")

Risk 0 1 All
Job
All 700 300 1000
2 444 186 630
1 144 56 200
3 97 51 148
0 15 7 22
---------------------------------------------------------------------------
---------------------------------------------

There is no significant difference concerning the job level


However, highly skilled or unskilled/non-resident customers are more likely to
default as compared to customers in 1 or 2 category
In [35]: stacked_barplot(data, "Housing", "Risk")

Risk 0 1 All
Housing
All 700 300 1000
own 527 186 713
rent 109 70 179
free 64 44 108
---------------------------------------------------------------------------
---------------------------------------------

Customers owning a house are less likely to default


Customers with free or rented housing are almost at the same risk of default
In [36]: stacked_barplot(data, "Saving accounts", "Risk")

Risk 0 1 All
Saving accounts
All 549 268 817
little 386 217 603
moderate 69 34 103
quite rich 52 11 63
rich 42 6 48
---------------------------------------------------------------------------
---------------------------------------------
As we saw earlier, customers with a little or moderate amount in saving accounts
take more credit but at the same time, they are most likely to default.
Rich customers are slightly less likely to default as compared to quite rich
customers
In [37]: stacked_barplot(data, "Checking account", "Risk")

Risk 0 1 All
Checking account
All 352 254 606
little 139 135 274
moderate 164 105 269
rich 49 14 63
---------------------------------------------------------------------------
---------------------------------------------
The plot further confirms the findings of the plot above.
Customers with a little amount in checking accounts are most likely to default as
compared to customers with a moderate amount, which in turn, are more likely as
compared to the rich customers.
In [38]: stacked_barplot(data, "Purpose", "Risk")

Risk 0 1 All
Purpose
All 700 300 1000
car 231 106 337
radio/TV 218 62 280
furniture/equipment 123 58 181
business 63 34 97
education 36 23 59
repairs 14 8 22
vacation/others 7 5 12
domestic appliances 8 4 12
---------------------------------------------------------------------------
---------------------------------------------

Customers who take credit for radio/TV are least likely to default. This might be
because their credit amount is small.
Customers who take credit for education or vacation are most likely to default.
Other categories have no significant difference between their default and non-
default ratio.
In [39]: plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spect
plt.show()
Credit amount and duration have a positive correlation which makes sense as
customers might take the credit for a longer duration if the amount of credit is high
Other variables have no significant correlation between them
Data Preparation for Modeling
Split data
In [40]: df = data.copy()

In [41]: X = df.drop(["Risk"], axis=1)


y = df["Risk"]

In [42]: # Splitting data into training, validation and test sets:


# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(


X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(


X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)

(600, 9) (200, 9) (200, 9)

Missing-Value Treatment
We will use mode to impute missing values in Saving accounts and Checking
account column.
In [43]: # Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Saving accounts", "Checking account"]

# fit and transform the imputer on train data


X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])

# Transform on validation and test data


X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])

# fit and transform the imputer on test data


X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])

In [45]: # Creating dummy variables for categorical variables


X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)

Model evaluation criterion


We will behere
because usingcompany
Recall ascould
a metric
face for our model
2 types performance
of losses
1. Could Give loan to defaulters - Loss of money
2. Not give Loan to non-defaulters - Loss of opportunity
Which Loss is greater?
Giving loan to defaulters i.e Predicting a person not at risk, while actually person is
at risk of making a default.
How to reduce this loss i.e need to reduce False Negatives?
Company wants recall to be maximized i.e. we need to reduce the number of false
negatives.
In [46]: models = [] # Empty list to store all the models

# Appending models into the list


models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = [] # Empty list to store all model's CV scores


names = [] # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))

print("\n" "Validation Performance:" "\n")

for name, model in models:


model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
score.append(scores)
print("{}: {}".format(name, scores))

Cross-Validation Performance:

Bagging: 24.444444444444446
Random forest: 24.444444444444446
GBM: 25.0
Adaboost: 25.0
Xgboost: 35.0
dtree: 43.33333333333333

Validation Performance:

Bagging: 0.2833333333333333
Random forest: 0.31666666666666665
GBM: 0.31666666666666665
Adaboost: 0.26666666666666666
Xgboost: 0.36666666666666664
dtree: 0.31666666666666665

In [47]: # Plotting boxplots for CV scores of all models defined above


fig = plt.figure()

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()
We can see that the decision tree is giving the highest cross-validated recall
followed by xgboost
The boxplot shows that the performance of decision tree and xgboost is consistent
and their performance on the validation set is also good
We will tune the best two models i.e. decision tree and xgboost and see if the
performance improves

Hyperparameter Tuning
We will tune decision tree and xgboost models using GridSearchCV and
RandomizedSearchCV. We will also compare the performance and time taken by
these two methods - grid search and randomized search.
First let's create two functions to calculate different metrics and confusion matrix,
so that we don't have to use the same code repeatedly for each model.
In [48]: # defining a function to compute different metrics to check performance of a
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model perf

model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)

acc = accuracy_score(target, pred) # to compute Accuracy


recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score

# creating a dataframe of metrics


df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)

return df_perf

In [49]: def confusion_matrix_sklearn(model, predictors, target):


"""
To plot the confusion_matrix with percentages

model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten(
for item in cm.flatten()
]
).reshape(2, 2)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")

Decision Tree
GridSearchCV
In [50]: # Creating pipeline
model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in GridSearchCV


param_grid = {
"criterion": ["gini", "entropy"],
"max_depth": [3, 4, 5, None],
"min_samples_split": [2, 4, 7, 10, 15],
}

# Type of scoring used to compare parameter combinations


scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=score

# Fitting parameters in GridSeachCV


grid_cv.fit(X_train, y_train)

print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.be
)

Best Parameters:{'criterion': 'gini', 'max_depth': None, 'min_samples_spli


t': 2}
Score: 0.40555555555555556

In [51]: # Creating new pipeline with best parameters


dtree_tuned1 = DecisionTreeClassifier(
random_state=1, criterion="gini", max_depth=None, min_samples_split=2
)

# Fit the model on training data


dtree_tuned1.fit(X_train, y_train)

DecisionTreeClassifier(random_state=1)
Out[51]:

In [52]: # Calculating different metrics on train set


dtree_grid_train = model_performance_classification_sklearn(
dtree_tuned1, X_train, y_train
)
print("Training performance:")
dtree_grid_train

Training performance:
Out[52]: Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000

In [53]: # Calculating different metrics on validation set


dtree_grid_val = model_performance_classification_sklearn(dtree_tuned1, X_va
print("Validation performance:")
dtree_grid_val

Validation performance:
Out[53]: Accuracy Recall Precision F1
0 0.595 0.317 0.322 0.319

In [54]: # creating confusion matrix


confusion_matrix_sklearn(dtree_tuned1, X_val, y_val)
The validation recall has same performance to the validation recall on model with
default parameters
The tuned decision tree model is overfitting the training data
The validation recall is still just ~31% i.e. the model is not good at identifying
defaulters
RandomizedSearchCV
In [55]: # Creating pipeline
model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomizedSearchCV


param_grid = {
"criterion": ["gini", "entropy"],
"max_depth": [3, 4, 5, None],
"min_samples_split": [2, 4, 7, 10, 15],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=param_grid,
n_iter=20,
scoring=scorer,
cv=5,
random_state=1,
)

# Fitting parameters in RandomizedSearchCV


randomized_cv.fit(X_train, y_train)

print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)

Best parameters are {'min_samples_split': 2, 'max_depth': None, 'criterio


n': 'entropy'} with CV score=0.36666666666666664:
In [56]: # Creating new pipeline with best parameters
dtree_tuned2 = DecisionTreeClassifier(
random_state=1, criterion="entropy", max_depth=None, min_samples_split=2
)

# Fit the model on training data


dtree_tuned2.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=1)
Out[56]:

In [57]: # Calculating different metrics on train set


dtree_random_train = model_performance_classification_sklearn(
dtree_tuned2, X_train, y_train
)
print("Training performance:")
dtree_random_train

Training performance:
Out[57]: Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000

In [58]: # Calculating different metrics on validation set


dtree_random_val = model_performance_classification_sklearn(dtree_tuned2, X_
print("Validation performance:")
dtree_random_val

Validation performance:
Out[58]: Accuracy Recall Precision F1
0 0.575 0.450 0.342 0.388

In [59]: # creating confusion matrix


confusion_matrix_sklearn(dtree_tuned1, X_val, y_val)

We reduced the number of iterations to only 20 but two out of the three parameters
are the same as what we got from the grid search.
The validation recall has increased by ~14% as compared to cross-validated recall
The recall and accuracy are slightly less but still similar to the results for the
decision tree model tuned with GridSearchCV is overfitting the training data

XGBoost
GridSearchCV
In [60]: %%time

#defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in GridSearchCV


param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}

# Type of scoring used to compare parameter combinations


scorer = metrics.make_scorer(metrics.recall_score)

#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=score

#Fitting parameters in GridSeachCV


grid_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params

Fitting 5 folds for each of 2304 candidates, totalling 11520 fits


Best parameters are {'gamma': 0, 'learning_rate': 0.01, 'max_depth': 1, 'n_
estimators': 50, 'reg_lambda': 5, 'scale_pos_weight': 10, 'subsample': 0.8}
with CV score=1.0:
Wall time: 4min 15s

In [61]: # building model with best parameters


xgb_tuned1 = XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.8,
learning_rate=0.01,
gamma=0,
eval_metric="logloss",
reg_lambda=5,
max_depth=1,
)

# Fit the model on training data


xgb_tuned1.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
Out[61]:
colsample_bynode=1, colsample_bytree=1, eval_metric='loglos
s',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01, max_delta_ste
p=0,
max_depth=1, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=
5,
scale_pos_weight=10, subsample=0.8, tree_method='exact',
validate_parameters=1, verbosity=None)

In [62]: # Calculating different metrics on train set


xgboost_grid_train = model_performance_classification_sklearn(
xgb_tuned1, X_train, y_train
)
print("Training performance:")
xgboost_grid_train

Training performance:
Out[62]: Accuracy Recall Precision F1
0 0.300 1.000 0.300 0.462

In [63]: # Calculating different metrics on validation set


xgboost_grid_val = model_performance_classification_sklearn(xgb_tuned1, X_va
print("Validation performance:")
xgboost_grid_val

Validation performance:
Out[63]: Accuracy Recall Precision F1
0 0.300 1.000 0.300 0.462

In [64]: # creating confusion matrix


confusion_matrix_sklearn(xgb_tuned1, X_val, y_val)

The validation recall has increased by ~65% as compared to the result from cross-
validation with default parameters.
The model is giving a generalized performance.
The model can identify most of the defaulters
RandomizedSearchCV
In [65]: %%time

# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')

# Parameter grid to pass in RandomizedSearchCV


param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}

# Type of scoring used to compare parameter combinations


scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
xgb_tuned2 = RandomizedSearchCV(estimator=model, param_distributions=param_g

#Fitting parameters in RandomizedSearchCV


xgb_tuned2.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(xgb_tuned2.best_par

Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'reg_lambd


a': 5, 'n_estimators': 50, 'max_depth': 1, 'learning_rate': 0.01, 'gamma':
1} with CV score=1.0:
Wall time: 5.39 s

In [66]: # building model with best parameters


xgb_tuned2 = XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
gamma=1,
subsample=0.9,
learning_rate=0.01,
eval_metric="logloss",
max_depth=1,
reg_lambda=5,
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,


Out[66]:
colsample_bynode=1, colsample_bytree=1, eval_metric='loglos
s',
gamma=1, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01, max_delta_ste
p=0,
max_depth=1, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=
5,
scale_pos_weight=10, subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None)
In [67]: # Calculating different metrics on train set
xgboost_random_train = model_performance_classification_sklearn(
xgb_tuned2, X_train, y_train
)
print("Training performance:")
xgboost_random_train

Training performance:
Out[67]: Accuracy Recall Precision F1
0 0.300 1.000 0.300 0.462

In [68]: # Calculating different metrics on validation set


xgboost_random_val = model_performance_classification_sklearn(xgb_tuned2, X_
print("Validation performance:")
xgboost_random_val

Validation performance:
Out[68]: Accuracy Recall Precision F1
0 0.300 1.000 0.300 0.462

In [69]: # creating confusion matrix


confusion_matrix_sklearn(xgb_tuned2, X_val, y_val)

We reduced the number of iterations to only 20 but the model performance is very
similar to the results for the xgboost model tuned with GridSearchCV
Comparing models from
RandomisedsearchCV GridsearchCV and
In [70]: # training performance comparison

models_train_comp_df = pd.concat(
[
dtree_grid_train.T,
dtree_random_train.T,
xgboost_grid_train.T,
xgboost_random_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree Tuned with Grid search",
"Decision Tree Tuned with Random search",
"Xgboost Tuned with Grid search",
"Xgboost Tuned with Random Search",
]
print("Training performance comparison:")
models_train_comp_df

Training performance comparison:


Out[70]: Decision Tree Decision Tree Tuned Xgboost Tuned Xgboost Tuned
Tuned with Grid with Random search with Grid search with Random
search Search
Accuracy 1.000 1.000 0.300 0.300
Recall 1.000 1.000 1.000 1.000
Precision 1.000 1.000 0.300 0.300
F1 1.000 1.000 0.462 0.462

In [71]: # Validation performance comparison

models_val_comp_df = pd.concat(
[
dtree_grid_val.T,
dtree_random_val.T,
xgboost_grid_val.T,
xgboost_random_val.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Decision Tree Tuned with Grid search",
"Decision Tree Tuned with Random search",
"Xgboost Tuned with Grid search",
"Xgboost Tuned with Random Search",
]
print("Validation performance comparison:")
models_val_comp_df

Validation performance comparison:


Out[71]: Decision Tree Decision Tree Tuned Xgboost Tuned Xgboost Tuned
Tuned with Grid with Random search with Grid search with Random
search Search
Accuracy 0.595 0.575 0.300 0.300
Recall 0.317 0.450 1.000 1.000
Precision 0.322 0.342 0.300 0.300
F1 0.319 0.388 0.462 0.462

We can see that XGBoost is giving a similar performance with GridSearchCV and
RandomizedSearchCV with a validation recall of ~1.00
Let's see the feature importance from the xgboost model tuned with GridSearchCV.
In [72]: feature_names = X_train.columns
importances = xgb_tuned1.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="c
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Savings account and duration are the two most important variables which make
sense as these variable play an important role in taking/returning credit.

Pipelines for productionizing the model


Now, we have a final model. let's use pipelines to put the model into production

Column Transformer
We know that we can use pipelines to standardize the model building, but the steps
in a pipeline are applied to each and every variable - how can we personalize the
pipeline to perform different processing on different columns
Column transformer allows different columns or column subsets of the input to be
transformed separately and the features generated by each transformer will be
concatenated to form a single feature space. This is useful for heterogeneous or
columnar data, to combine several feature extraction mechanisms or
transformations into a single transformer.
We will create 2 different pipelines, one for numerical columns and one for
categorical columns
For numerical columns, we will do missing value imputation as pre-processing
For categorical columns, we will do one hot encoding and missing value imputation
as pre-processing
We are doing missing value imputation for the whole data, so that if there is any
missing value in the data in future that can be taken care of.
In [73]: # creating a list of numerical variables
numerical_features = ["Age", "Credit amount", "Duration"]

# creating a transformer for numerical variables, which will apply simple im


numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="me

# creating a list of categorical variables


categorical_features = [
"Sex",
"Job",
"Housing",
"Saving accounts",
"Checking account",
"Purpose",
]

# creating a transformer for categorical variables, which will first apply s


# then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)

# handle_unknown = "ignore", allows model to handle any unknown category in

# combining categorical transformer and numerical transformer using a column

preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
# remainder = "passthrough" has been used, it will allow variables that are
# but not in "numerical_columns" and "categorical_columns" to pass through t

In [74]: # Separating target variable and other variables


X = data.drop("Risk", axis=1)
Y = data["Risk"]

Now we already know the best model we need to process with, so we don't need to
divide data into 3 parts
In [75]: # Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)

(700, 9) (300, 9)

In [76]: # Creating new pipeline with best parameters


model = Pipeline(
steps=[
("pre", preprocessor),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.8,
learning_rate=0.01,
gamma=0,
eval_metric="logloss",
reg_lambda=5,
max_depth=1,
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)
Pipeline(steps=[('pre',
Out[76]:
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('impute
r',
SimpleIm
puter(strategy='median'))]),
['Age', 'Credit amount',
'Duration']),
('cat',
Pipeline(steps=[('impute
r',
SimpleIm
puter(strategy='most_frequent')),
('oneho
t',
OneHotEn
coder(handle_unknown='ignore'))]),
['Sex', 'Job', 'Housing',
'Saving accounts'...
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.
01,
max_delta_step=0, max_depth=1,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=8, num_parallel_tree=1, random_state=
1,
reg_alpha=0, reg_lambda=5, scale_pos_weight=
10,
subsample=0.8, tree_method='exact',
validate_parameters=1, verbosity=None))])

Conclusion and Insights


The best test recall is ~84% but the test precision is very low i.e ~32% at the same
time. This means that the model is not good at identifying non-defaulter, therefore,
the bank can lose many opportunities of giving credit to non-defaulters.
The model performance can be improved, especially in terms of precision and the
bank can use use the model for new customers once desired level of model
performance is achieved.
We saw in our analysis that customers with a little or moderate amount in saving or
checking accounts are more likely to default. The bank can be more strict with their
rules or interest rates to compensate for the risk.
Customers with high credit amounts or who take credit for a longer duration are
more likely to default. The bank should be more careful while giving high credit
amounts or for a longer duration.
We saw that customers who have rented or free housing are more likely to default.
The bank should keep more details about such customers like hometown address,
etc. to be able to track them.
Our analysis showed that younger customers are slightly more likely to default. The
bank can alter its policies to suppress this.

You might also like