Germany Credit Analysis
Germany Credit Analysis
Context
When a bank receives a loan application, based on the applicant’s profile the bank has to
decide whether to go ahead with the loan approval or not. Two types of risks are
associated with the bank’s decision –
If the applicant is a good credit risk, i.e. is likely to repay the loan, then not
approving the loan to the person results in a loss of business to the bank
If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving
the loan to the person results in a financial loss to the bank
To minimize this loss HRE bank wants to automate this process using a predictive model,
that will predict if a customer is at risk of making a default or not based on the
customer’s demographic and socio-economic profiles
You as a Data scientist at HRE bank has been assigned the work of building a predictive
model that will predict if a customer is at risk of default or not
Objective
The objective is to build a model to predict whether a person would default or not. In this
dataset, the target variable is 'Risk'.
Dataset Description
Age (Numeric: Age in years)
Sex (Categories: male, female)
Job (Categories : 0 - unskilled and non-resident, 1 - unskilled and resident, 2 -
skilled, 3 - highly skilled)
Housing (Categories: own, rent, or free)
Saving accounts (Categories: little, moderate, quite rich, rich)
Checking account (Categories: little, moderate, rich)
Credit amount (Numeric: Amount of credit in DM - Deutsche Mark)
Duration (Numeric: Duration for which the credit is given in months)
Purpose (Categories: car, furniture/equipment, radio/TV, domestic appliances,
repairs, education, business, vacation/others)
Risk (0 - Person is not at risk, 1 - Person is at risk(defaulter))
Importing libraries
In [2]: # To help with reading and manipulating data
import pandas as pd
import numpy as np
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (go
%load_ext nb_black
Loading Data
In [3]: # Loading the dataset
german = pd.read_csv("German_Credit.csv")
In [4]: # Checking the number of rows and columns in the data
german.shape
(1000, 10)
Out[4]:
Data Overview
In [5]: data = german.copy()
Out[6]:
Age Saving Checking Credit Duration
Sex Job Housing accounts Purpose R
account amount
0 67 male 2 own NaN little 1169 6 radio/TV
1 22 female 2 own little moderate 5951 48 radio/TV
2 49 male 1 own little NaN 2096 12 education
3 45 male 2 free little little 7882 42 furniture/equipment
4 53 male 2 free little little 4870 24 car
In [8]: # let's check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1000 non-null int64
1 Sex 1000 non-null object
2 Job 1000 non-null int64
3 Housing 1000 non-null object
4 Saving accounts 817 non-null object
5 Checking account 606 non-null object
6 Credit amount 1000 non-null int64
7 Duration 1000 non-null int64
8 Purpose 1000 non-null object
9 Risk 1000 non-null int64
dtypes: int64(5), object(5)
memory usage: 78.2+ KB
0
Out[9]:
Age 0.000
Out[10]:
Sex 0.000
Job 0.000
Housing 0.000
Saving accounts 18.300
Checking account 39.400
Credit amount 0.000
Duration 0.000
Purpose 0.000
Risk 0.000
dtype: float64
Saving accounts column has 18.3% missing values out of the total
observations.
Checking account column has 39.4% missing values out of the total
observations.
We will impute these values after splitting the data into train,validation and test sets.
In [11]: # Checking for the null value in the dataset
data.isna().sum()
Age 0
Out[11]:
Sex 0
Job 0
Housing 0
Saving accounts 183
Checking account 394
Credit amount 0
Duration 0
Purpose 0
Risk 0
dtype: int64
Age 53
Out[12]:
Sex 2
Job 4
Housing 3
Saving accounts 4
Checking account 3
Credit amount 921
Duration 33
Purpose 8
Risk 2
dtype: int64
Age has only 53 unique values i.e. most of the customers are of similar age
We have only three continuous variables - Age, Credit Amount and Duration.
All other variables are categorical
In [13]: # let's view the statistical summary of the numerical columns in the data
data.describe().T
Mean value for the age column is approx 35 and the median is 33. This shows that
majority of the customers are under 35 years of age.
Mean amount of credit is approx 3,271 but it has a wide range of 250 to 18,424. We
will explore this further in univariate analysis.
Mean duration for which the credit is given is approx 21 months.
Checking the value count for each category of categorical variables
In [14]: # Making a list of all catrgorical variables
cat_col = [
"Sex",
"Job",
"Housing",
"Saving accounts",
"Checking account",
"Purpose",
"Risk",
]
male 690
female 310
Name: Sex, dtype: int64
----------------------------------------
2 630
1 200
3 148
0 22
Name: Job, dtype: int64
----------------------------------------
own 713
rent 179
free 108
Name: Housing, dtype: int64
----------------------------------------
little 603
moderate 103
quite rich 63
rich 48
Name: Saving accounts, dtype: int64
----------------------------------------
little 274
moderate 269
rich 63
Name: Checking account, dtype: int64
----------------------------------------
car 337
radio/TV 280
furniture/equipment 181
business 97
education 59
repairs 22
vacation/others 12
domestic appliances 12
Name: Purpose, dtype: int64
----------------------------------------
0 700
1 300
Name: Risk, dtype: int64
----------------------------------------
Univariate analysis
In [15]: # function to plot a boxplot and a histogram along the same scale.
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="wint
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Observation on Age
In [16]: # Observations on Customer_age
histogram_boxplot(data, "Age")
The distribution of age is right-skewed
The boxplot shows that there are outliers at the right end
We will not treat these outliers as they represent the real market trend
Observation on Credit Amount
In [17]: histogram_boxplot(data, "Credit amount")
The distribution of the duration for which the credit is given is right-skewed
The boxplot shows that there are outliers at the right end
We will not treat these outliers as they represent the real market trend
In [19]: # function to create labeled barplots
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display al
"""
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
Observations on Risk
In [20]: # observations on Risk
labeled_barplot(data, "Risk")
Major of the customers, approx 71%, who take credit have their own house
Approx 18% of customers are living in a rented house
There are only 11% of customers who have free housing. These are the customers
who live in a house given by their company or organization
Observations on Job
In [23]: # observations on Job
labeled_barplot(data, "Job")
Majority of the customers i.e. 63% fall into the skilled category.
There are only approx 15% of customers that lie in the highly skilled category which
makes sense as these may be the persons with high education or highly
experienced.
There are very few observations, approx 22%, with 0 or 1 job category.
Observations on Saving accounts
In [24]: # observations on Saving accounts
labeled_barplot(data, "Saving accounts")
Approx 70% of customers who take credit have a little or moderate amount in their
savings account. This makes sense as these customers would need credit more
than the other categories.
Approx 11% of customers who take credit are in a rich category based on their
balance in the savings account.
Note that the percentages do not add up to 100 as we have missing values in this
column.
Observations on Checking account
In [25]: # observations on Checking account
labeled_barplot(data, "Checking account")
Approx 54% of customers who take credit have a little or moderate amount in their
checking account. This makes sense as these customers would need credit more
than the other categories.
Approx 6% of customers who take credit are in the rich category based on their
balance in checking account.
Note that the percentages do not add up to 100 as we have missing values in this
column.
Observations on Purpose
In [26]: # observations on Purpose
labeled_barplot(data, "Purpose")
The plot shows that most customers take credit for luxury items like cars, radio or
furniture/equipment, domestic appliances.
Approximately just 16% of customers take credit for business or education
Bivariate Analysis
In [27]: sns.pairplot(data, hue="Risk")
<seaborn.axisgrid.PairGrid at 0x267080d8ac0>
Out[27]:
There are overlaps i.e. no clear distinction in the distribution of variables for people
who have defaulted and did not default.
Let's explore this further with the help of other plots.
In [28]: sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Age", data=data, orient="vertical")
<matplotlib.axes._subplots.AxesSubplot at 0x267075b2d00>
Out[28]:
We can see that the median age of defaulters is less than the median age of non-
defaulters.
This shows that younger customers are more likely to default.
There are outliers in boxplots of both class distributions
In [29]: sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Credit amount", data=data, orient="vertical")
<matplotlib.axes._subplots.AxesSubplot at 0x267080cfa60>
Out[29]:
We can see that the third quartile amount of defaulters is much more than the third
quartile amount of non-defaulters.
This shows that customers with high credit amounts are more likely to default.
There are outliers in boxplots of both class distributions
In [30]: sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Risk", y="Duration", data=data, orient="vertical")
<matplotlib.axes._subplots.AxesSubplot at 0x2670769e400>
Out[30]:
We can see that the second and third quartile duration of defaulters is much more
than the second and third quartile duration of non-defaulters.
This shows that customers with high duration are more likely to default.
In [31]: sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Saving accounts", y="Age", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x2670b4509d0>
Out[31]:
The plot shows that customers with higher age are in the rich or quite rich category.
Age of the customers in the little and moderate category is slightly less but there
are outliers in both of the distributions.
In [32]: # function to plot stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_val
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
Risk 0 1 All
Sex
All 700 300 1000
male 499 191 690
female 201 109 310
---------------------------------------------------------------------------
---------------------------------------------
We saw earlier that the percentage of male customers is more than the female
customers. This plot shows that female customers are more likely to default as
compared to male customers.
In [34]: stacked_barplot(data, "Job", "Risk")
Risk 0 1 All
Job
All 700 300 1000
2 444 186 630
1 144 56 200
3 97 51 148
0 15 7 22
---------------------------------------------------------------------------
---------------------------------------------
Risk 0 1 All
Housing
All 700 300 1000
own 527 186 713
rent 109 70 179
free 64 44 108
---------------------------------------------------------------------------
---------------------------------------------
Risk 0 1 All
Saving accounts
All 549 268 817
little 386 217 603
moderate 69 34 103
quite rich 52 11 63
rich 42 6 48
---------------------------------------------------------------------------
---------------------------------------------
As we saw earlier, customers with a little or moderate amount in saving accounts
take more credit but at the same time, they are most likely to default.
Rich customers are slightly less likely to default as compared to quite rich
customers
In [37]: stacked_barplot(data, "Checking account", "Risk")
Risk 0 1 All
Checking account
All 352 254 606
little 139 135 274
moderate 164 105 269
rich 49 14 63
---------------------------------------------------------------------------
---------------------------------------------
The plot further confirms the findings of the plot above.
Customers with a little amount in checking accounts are most likely to default as
compared to customers with a moderate amount, which in turn, are more likely as
compared to the rich customers.
In [38]: stacked_barplot(data, "Purpose", "Risk")
Risk 0 1 All
Purpose
All 700 300 1000
car 231 106 337
radio/TV 218 62 280
furniture/equipment 123 58 181
business 63 34 97
education 36 23 59
repairs 14 8 22
vacation/others 7 5 12
domestic appliances 8 4 12
---------------------------------------------------------------------------
---------------------------------------------
Customers who take credit for radio/TV are least likely to default. This might be
because their credit amount is small.
Customers who take credit for education or vacation are most likely to default.
Other categories have no significant difference between their default and non-
default ratio.
In [39]: plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spect
plt.show()
Credit amount and duration have a positive correlation which makes sense as
customers might take the credit for a longer duration if the amount of credit is high
Other variables have no significant correlation between them
Data Preparation for Modeling
Split data
In [40]: df = data.copy()
Missing-Value Treatment
We will use mode to impute missing values in Saving accounts and Checking
account column.
In [43]: # Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Saving accounts", "Checking account"]
Cross-Validation Performance:
Bagging: 24.444444444444446
Random forest: 24.444444444444446
GBM: 25.0
Adaboost: 25.0
Xgboost: 35.0
dtree: 43.33333333333333
Validation Performance:
Bagging: 0.2833333333333333
Random forest: 0.31666666666666665
GBM: 0.31666666666666665
Adaboost: 0.26666666666666666
Xgboost: 0.36666666666666664
dtree: 0.31666666666666665
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
We can see that the decision tree is giving the highest cross-validated recall
followed by xgboost
The boxplot shows that the performance of decision tree and xgboost is consistent
and their performance on the validation set is also good
We will tune the best two models i.e. decision tree and xgboost and see if the
performance improves
Hyperparameter Tuning
We will tune decision tree and xgboost models using GridSearchCV and
RandomizedSearchCV. We will also compare the performance and time taken by
these two methods - grid search and randomized search.
First let's create two functions to calculate different metrics and confusion matrix,
so that we don't have to use the same code repeatedly for each model.
In [48]: # defining a function to compute different metrics to check performance of a
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model perf
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
return df_perf
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten(
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Decision Tree
GridSearchCV
In [50]: # Creating pipeline
model = DecisionTreeClassifier(random_state=1)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=score
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.be
)
DecisionTreeClassifier(random_state=1)
Out[51]:
Training performance:
Out[52]: Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
Validation performance:
Out[53]: Accuracy Recall Precision F1
0 0.595 0.317 0.322 0.319
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=param_grid,
n_iter=20,
scoring=scorer,
cv=5,
random_state=1,
)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
DecisionTreeClassifier(criterion='entropy', random_state=1)
Out[56]:
Training performance:
Out[57]: Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
Validation performance:
Out[58]: Accuracy Recall Precision F1
0 0.575 0.450 0.342 0.388
We reduced the number of iterations to only 20 but two out of the three parameters
are the same as what we got from the grid search.
The validation recall has increased by ~14% as compared to cross-validated recall
The recall and accuracy are slightly less but still similar to the results for the
decision tree model tuned with GridSearchCV is overfitting the training data
XGBoost
GridSearchCV
In [60]: %%time
#defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=score
Training performance:
Out[62]: Accuracy Recall Precision F1
0 0.300 1.000 0.300 0.462
Validation performance:
Out[63]: Accuracy Recall Precision F1
0 0.300 1.000 0.300 0.462
The validation recall has increased by ~65% as compared to the result from cross-
validation with default parameters.
The model is giving a generalized performance.
The model can identify most of the defaulters
RandomizedSearchCV
In [65]: %%time
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
#Calling RandomizedSearchCV
xgb_tuned2 = RandomizedSearchCV(estimator=model, param_distributions=param_g
Training performance:
Out[67]: Accuracy Recall Precision F1
0 0.300 1.000 0.300 0.462
Validation performance:
Out[68]: Accuracy Recall Precision F1
0 0.300 1.000 0.300 0.462
We reduced the number of iterations to only 20 but the model performance is very
similar to the results for the xgboost model tuned with GridSearchCV
Comparing models from
RandomisedsearchCV GridsearchCV and
In [70]: # training performance comparison
models_train_comp_df = pd.concat(
[
dtree_grid_train.T,
dtree_random_train.T,
xgboost_grid_train.T,
xgboost_random_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree Tuned with Grid search",
"Decision Tree Tuned with Random search",
"Xgboost Tuned with Grid search",
"Xgboost Tuned with Random Search",
]
print("Training performance comparison:")
models_train_comp_df
models_val_comp_df = pd.concat(
[
dtree_grid_val.T,
dtree_random_val.T,
xgboost_grid_val.T,
xgboost_random_val.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Decision Tree Tuned with Grid search",
"Decision Tree Tuned with Random search",
"Xgboost Tuned with Grid search",
"Xgboost Tuned with Random Search",
]
print("Validation performance comparison:")
models_val_comp_df
We can see that XGBoost is giving a similar performance with GridSearchCV and
RandomizedSearchCV with a validation recall of ~1.00
Let's see the feature importance from the xgboost model tuned with GridSearchCV.
In [72]: feature_names = X_train.columns
importances = xgb_tuned1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="c
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Savings account and duration are the two most important variables which make
sense as these variable play an important role in taking/returning credit.
Column Transformer
We know that we can use pipelines to standardize the model building, but the steps
in a pipeline are applied to each and every variable - how can we personalize the
pipeline to perform different processing on different columns
Column transformer allows different columns or column subsets of the input to be
transformed separately and the features generated by each transformer will be
concatenated to form a single feature space. This is useful for heterogeneous or
columnar data, to combine several feature extraction mechanisms or
transformations into a single transformer.
We will create 2 different pipelines, one for numerical columns and one for
categorical columns
For numerical columns, we will do missing value imputation as pre-processing
For categorical columns, we will do one hot encoding and missing value imputation
as pre-processing
We are doing missing value imputation for the whole data, so that if there is any
missing value in the data in future that can be taken care of.
In [73]: # creating a list of numerical variables
numerical_features = ["Age", "Credit amount", "Duration"]
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
# remainder = "passthrough" has been used, it will allow variables that are
# but not in "numerical_columns" and "categorical_columns" to pass through t
Now we already know the best model we need to process with, so we don't need to
divide data into 3 parts
In [75]: # Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
(700, 9) (300, 9)