Problem Statement
Based on the given loan data can we understand the major factors or characteristics of a borrower which makes them to get into delinquent
stage.
• Delinquency is a major metric in assessing risk as more and more customers getting delinquent means the risk of customers that will default
will also increase.
• The main objective is to minimize the risk for which you need to build a decision tree model using CART technique that will identify various risk
and non-risk attributes of borrower’s to get into delinquent stage
Importing libraries and Loading data
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
ld_df = pd.read_csv("Loan Delinquent Dataset.csv")
Checking the data
ld_df.head()
Dropping unwanted variables
Sdelinquent can also be dropped instead of delinquent.
ld_df=ld_df.drop(["ID","delinquent"],axis=1)
ld_df.head()
ld_df.shape
ld_df.info()
many columns are of type object i.e. strings. These need to be converted to ordinal type
Geting unique counts of all Objects
print('term \n',ld_df.term.value_counts())
print('\n')
print('gender \n',ld_df.gender.value_counts())
print('\n')
print('purpose \n',ld_df.purpose.value_counts())
print('\n')
print('home_ownership \n',ld_df.home_ownership.value_counts())
print('\n')
print('age \n',ld_df.age.value_counts())
print('\n')
print('FICO \n',ld_df.FICO.value_counts())
Note:
Decision tree in Python can take only numerical / categorical colums. It cannot take string / object types.
The following code loops through each column and checks if the column type is object then converts those columns into categorical with each
distinct value becoming a category.
for feature in ld_df.columns:
if ld_df[feature].dtype == 'object':
print('\n')
print('feature:',feature)
print(pd.Categorical(ld_df[feature].unique()))
print(pd.Categorical(ld_df[feature].unique()).codes)
ld_df[feature] = pd.Categorical(ld_df[feature]).codes
For each feature, look at the 2nd and 4th row to get the encoding mappings. Do not look at the line starting with 'Categories'
Comparing the unique counts from above
print('term \n',ld_df.term.value_counts())
print('\n')
print('gender \n',ld_df.gender.value_counts())
print('\n')
print('purpose \n',ld_df.purpose.value_counts())
print('\n')
print('home_ownership \n',ld_df.home_ownership.value_counts())
print('\n')
print('age \n',ld_df.age.value_counts())
print('\n')
print('FICO \n',ld_df.FICO.value_counts())
ld_df.info()
ld_df.head()
Label Encoding has been done and all columns are converted to number
Proportion of 1s and 0s
ld_df.Sdelinquent.value_counts(normalize=True)
print(ld_df.Sdelinquent.value_counts())
print('%1s',7721/(7721+3827))
print('%0s',3827/(7721+3827))
Extracting the target column into separate vectors for training set and test set
X = ld_df.drop("Sdelinquent", axis=1)
y = ld_df.pop("Sdelinquent")
X.head()
Splitting data into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, train_labels, test_labels = train_test_split(X, y, test_size=.30, random_state=1)
Checking the dimensions of the training and test data
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)
print('Total Obs',8083+3465)
Building a Decision Tree Classifier
# Initialise a Decision Tree Classifier
# Fit the model
from sklearn import tree
train_char_label = ['No', 'Yes']
ld_Tree_File = open('ld_Tree_File.dot','w')
dot_data = tree.export_graphviz(dt_model,
out_file=ld_Tree_File,
feature_names = list(X_train),
class_names = list(train_char_label))
ld_Tree_File.close()
The above code will save a .dot file in your working directory.
WebGraphviz is Graphviz in the Browser.
Copy paste the contents of the file into the link below to get the visualization
http://webgraphviz.com/
Variable Importance
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values('Imp',ascending=False))
Predicting Test Data
y_predict.shape
Regularising the Decision Tree
Adding Tuning Parameters
reg_dt_model = DecisionTreeClassifier(criterion = 'gini', max_depth = 30,min_samples_leaf=100,min_samples_split=1000, random_state=1)
reg_dt_model.fit(X_train, train_labels)
Generating New Tree
ld_tree_regularized = open('ld_tree_regularized.dot','w')
dot_data = tree.export_graphviz(reg_dt_model, out_file= ld_tree_regularized , feature_names = list(X_train), class_names = list(train_cha
ld_tree_regularized.close()
dot_data
Variable Importance
Predicting on Training and Test dataset
# Complete the below code
ytrain_predict =
ytest_predict =
print('ytrain_predict',ytrain_predict.shape)
print('ytest_predict',ytest_predict.shape)
Getting the Predicted Classes
ytest_predict
Getting the Predicted Probabilities
ytest_predict_prob=reg_dt_model.predict_proba(X_test)
ytest_predict_prob
pd.DataFrame(ytest_predict_prob).head()
Model Evaluation
Measuring AUC-ROC Curve
import matplotlib.pyplot as plt
AUC and ROC for the training data
# predict probabilities
probs = reg_dt_model.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(train_labels, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(train_labels, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
# show the plot
plt.show()
AUC and ROC for the test data
# predict probabilities
# keep probabilities for the positive outcome only
# calculate AUC
# calculate roc curve
# plot the roc curve for the model
# show the plot
Confusion Matrix for the training data
from sklearn.metrics import classification_report,confusion_matrix
#Train Data Accuracy
reg_dt_model.score(X_train,train_labels)
print((1985+4742)/(1985+650+706+4742))
print(classification_report(train_labels, ytrain_predict))
Confusion Matrix for test data
confusion_matrix(test_labels, ytest_predict)
#Test Data Accuracy
reg_dt_model.score(X_test,test_labels)
print((922+1941)/(922+270+332+1941))
print(classification_report(test_labels, ytest_predict))
Conclusion
Accuracy on the Training Data: 83%
Accuracy on the Test Data: 82%
AUC on the Training Data: 87.9%
AUC on the Test: 88.1%
Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This proves no overfitting or underfitting has happened, and
overall the model is a good model for classification
Also,here analysing the metric recall is more important because, we don't want to miss out on those customers who are likely to delinquent,
having a predictive power to catch the delinquincies would help the banks to be more proactive in their approach, from the confusion matrix of
test data we can see that our model has miss calssified 332(False Negatives) customers as non delinquent but infact they are delinquent.
FICO, term and gender (in same order of preference) are the most important variables in determining if a borrower will get into a delinquent
stage