LoanTap Case Study
LoanTap Case Study
Introduction
LoanTap is at the forefront of offering tailored financial solutions to milennials. Their innovative approach
seeks to harness data science for refining their credit underwriting process. The focus here is the Personal
Loan segment. A deep dive into the dataset can reveal patterns in borrower behaviour and creditworthiness.
Analyzing this dataset can provide crucial insights into the financial behaviours, spending habits and
potential risk associated with each borrower. The insights gained can optimize loan disbursal, balancing
customer outreach with risk management.
What is expected
Assuming you are a data scientist at LoanTap, you are tasked with analyzing the dataset to determine the
creditworthiness of potential borrowers. Your ultimate objective is to build a logistic regression model,
evaluate its performance, and provide actionable insights for the underwriting process.
1. Data
The analysis was done on the data located at -
https://drive.google.com/file/d/1ZPYj7CZCfxntE8p2Lze_4QO4MyEOy6_d/view?usp=sharing
2. Libraries
Below are the libraries required
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
3. Data Loading
Loading the data into Pandas dataframe for easily handling of data
*************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_amnt 396030 non-null float64
1 term 396030 non-null object
2 int_rate 396030 non-null float64
3 installment 396030 non-null float64
4 grade 396030 non-null object
5 sub_grade 396030 non-null object
6 emp_title 373103 non-null object
7 emp_length 377729 non-null object
8 home_ownership 396030 non-null object
9 annual_inc 396030 non-null float64
10 verification_status 396030 non-null object
11 issue_d 396030 non-null object
12 loan_status 396030 non-null object
13 purpose 396030 non-null object
14 title 394274 non-null object
15 dti 396030 non-null float64
16 earliest_cr_line 396030 non-null object
17 open_acc 396030 non-null float64
18 pub_rec 396030 non-null float64
19 revol_bal 396030 non-null float64
20 revol_util 395754 non-null float64
21 total_acc 396030 non-null float64
22 initial_list_status 396030 non-null object
23 application_type 396030 non-null object
24 mort_acc 358235 non-null float64
25 pub_rec_bankruptcies 395495 non-null float64
26 address 396030 non-null object
dtypes: float64(12), object(15)
memory usage: 81.6+ MB
None
*************************************************
*************************************************
Shape of the dataset is (396030, 27)
*************************************************
*************************************************
Number of nan/null values in each column:
loan_amnt 0
term 0
int_rate 0
installment 0
grade 0
sub_grade 0
emp_title 22927
emp_length 18301
home_ownership 0
annual_inc 0
verification_status 0
issue_d 0
loan_status 0
purpose 0
title 1756
dti 0
earliest_cr_line 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 276
total_acc 0
initial_list_status 0
application_type 0
mort_acc 37795
pub_rec_bankruptcies 535
address 0
dtype: int64
*************************************************
*************************************************
Number of unique values in each column:
loan_amnt 1397
term 2
int_rate 566
installment 55706
grade 7
sub_grade 35
emp_title 173105
emp_length 11
home_ownership 6
annual_inc 27197
verification_status 3
issue_d 115
loan_status 2
purpose 14
title 48816
dti 4262
earliest_cr_line 684
open_acc 61
pub_rec 20
revol_bal 55622
revol_util 1226
total_acc 118
initial_list_status 2
application_type 3
mort_acc 33
pub_rec_bankruptcies 9
address 393700
dtype: int64
*************************************************
*************************************************
Duplicate entries:
False 396030
Name: count, dtype: int64
*************************************************
Out[3]: loan_amnt term int_rate installment grade sub_grade emp_title emp_length home_ownership annu
36
0 10000.0 11.44 329.48 B B4 Marketing 10+ years RENT 117
months
36 Credit
1 8000.0 11.99 265.68 B B5 4 years MORTGAGE 65
months analyst
36
2 15600.0 10.49 506.97 B B3 Statistician < 1 year RENT 43
months
36 Client
3 7200.0 6.49 220.65 A A2 6 years RENT 54
months Advocate
Destiny
60
4 24375.0 17.27 609.33 C C5 Management 9 years MORTGAGE 55
months
Inc.
5 rows × 27 columns
In [4]: df[df.columns[10:20]]
Jan-
0 Not Verified Fully Paid vacation Vacation 26.24 Jun-1990 16.0
2015
Jan- Debt
1 Not Verified Fully Paid debt_consolidation 22.05 Jul-2004 17.0
2015 consolidation
Oct- Debt
396025 Source Verified Fully Paid debt_consolidation 15.63 Nov-2004 6.0
2015 consolidation
Feb- Debt
396026 Source Verified Fully Paid debt_consolidation 21.45 Feb-2006 6.0
2015 consolidation
Aug-
396028 Verified Fully Paid debt_consolidation Loanforpayoff 15.88 Nov-1990 9.0
2012
In [5]: df[df.columns[20:]]
0174
0 41.8 25.0 w INDIVIDUAL 0.0 0.0 Gateway\r\nMendo
O
1076 Carney F
1 53.3 27.0 f INDIVIDUAL 3.0 0.0 347\r\nLoganmo
87025 Mark D
2 92.2 26.0 f INDIVIDUAL 0.0 0.0 269\r\nNew Sab
8
3 21.5 13.0 f INDIVIDUAL 0.0 0.0 Ford\r\nDelacruzs
6
4 69.8 43.0 f INDIVIDUAL 1.0 0.0 Roads\r\nGreggs
12951 W
396025 34.3 23.0 w INDIVIDUAL 0.0 0.0 Crossing\r\nJoh
D
787
396029 91.3 19.0 f INDIVIDUAL NaN 0.0 Causeway\r\nBria
A
In [6]: df.describe()
# Convert to datetime
df['issue_d'] = pd.to_datetime(df['issue_d'], format='%b-%Y')
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], format='%b-%Y')
df.info()
Columns emp_title and title can be dropped as they would not have an effect on the loan approval
Null values in revol_util and pub_rec_bankruptcies are small in number and hence can be dropped
Let us check the distribution of remaining features before deciding on how to handle the null values
In [12]: df.isna().sum()
loan_amnt 0
Out[12]:
term 0
int_rate 0
installment 0
grade 0
emp_length 0
home_ownership 0
annual_inc 0
verification_status 0
issue_d 0
loan_status 0
purpose 0
dti 0
earliest_cr_line 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 0
total_acc 0
initial_list_status 0
application_type 0
mort_acc 0
pub_rec_bankruptcies 0
zip_code 0
dtype: int64
In [17]: num_cols = 6
num_rows = int(np.ceil(len(numerical_columns)/num_cols))
fig, axs = plt.subplots(num_rows,num_cols,figsize=(10,15))
for idx in range(len(numerical_columns)):
ax = plt.subplot(num_rows, num_cols, idx+1)
sns.boxplot(ax = ax, data=df, y = numerical_columns[idx])
plt.tight_layout()
plt.show()
In [18]: numerical_columns = df.select_dtypes(include=np.number).columns
numerical_columns = list(numerical_columns)
numerical_columns.remove('pub_rec')
numerical_columns.remove('pub_rec_bankruptcies')
numerical_columns = pd.core.indexes.base.Index(numerical_columns)
column_outlier_dictionary = {}
for column in numerical_columns:
lower_outliers, higher_outliers = detectOutliers_std(df[column])
column_outlier_dictionary[column] = [lower_outliers, higher_outliers]
#print('*'*50)
#print(f'Outliers of \'{column}\' column are:')
#print("Lower outliers:\n", lower_outliers)
#print("Higher outliers:\n", higher_outliers)
#print('*'*50, end="\n")
Based on the boxplot, the number of outliers using IQR method and standard deviation method, I will
remove the outliers using the standard deviation method except for columns pub_rec and
pub_rec_bankruptcies which will be removed based on manual check.
In [21]: df['pub_rec'].value_counts()
pub_rec
Out[21]:
0.0 315552
1.0 47129
2.0 5107
3.0 1424
4.0 481
5.0 218
6.0 108
7.0 47
8.0 31
10.0 11
9.0 10
11.0 6
13.0 4
12.0 4
19.0 2
40.0 1
17.0 1
86.0 1
24.0 1
15.0 1
Name: count, dtype: int64
In [22]: df['pub_rec_bankruptcies'].value_counts()
pub_rec_bankruptcies
Out[22]:
0.0 327200
1.0 40774
2.0 1716
3.0 332
4.0 75
5.0 30
6.0 6
7.0 4
8.0 2
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Index: 370106 entries, 0 to 396029
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_amnt 370106 non-null float64
1 term 370106 non-null category
2 int_rate 370106 non-null float64
3 installment 370106 non-null float64
4 grade 370106 non-null category
5 emp_length 370106 non-null category
6 home_ownership 370106 non-null category
7 annual_inc 370106 non-null float64
8 verification_status 370106 non-null category
9 issue_d 370106 non-null datetime64[ns]
10 loan_status 370106 non-null category
11 purpose 370106 non-null category
12 dti 370106 non-null float64
13 earliest_cr_line 370106 non-null datetime64[ns]
14 open_acc 370106 non-null float64
15 pub_rec 370106 non-null float64
16 revol_bal 370106 non-null float64
17 revol_util 370106 non-null float64
18 total_acc 370106 non-null float64
19 initial_list_status 370106 non-null category
20 application_type 370106 non-null category
21 mort_acc 370106 non-null float64
22 pub_rec_bankruptcies 370106 non-null float64
23 zip_code 370106 non-null category
dtypes: category(10), datetime64[ns](2), float64(12)
memory usage: 45.9 MB
Insight
The number of columns reduced to 370106 from the original 396030 rows
Insight
The median of the loan amount slighly higher for loans which were charged off
group_0_list = [0.0]
pub_rec_list = list(df['mort_acc'].explode().unique())
group_1_list = list(set(pub_rec_list) - set(group_0_list))
df['any_mort'] = df['mort_acc'].replace(group_0_list, 0)
df['any_mort'] = df['any_mort'].replace(group_1_list, 1)
df['any_mort'] = df['any_mort'].astype('category')
group_0_list = [0.0]
pub_rec_list = list(df['pub_rec_bankruptcies'].explode().unique())
group_1_list = list(set(pub_rec_list) - set(group_0_list))
df['any_bankruptcies'] = df['pub_rec_bankruptcies'].replace(group_0_list, 0)
df['any_bankruptcies'] = df['any_bankruptcies'].replace(group_1_list, 1)
df['any_bankruptcies'] = df['any_bankruptcies'].astype('category')
Insight
loan amount is highly correlated with installment
There is good correlation between loan amount - annual income, loan amount - revol balance,
installment - annual income, installment - revol balance, open account - total account
5. Data Preprocessing
In [36]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 370106 entries, 0 to 396029
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_amnt 370106 non-null float64
1 term 370106 non-null category
2 int_rate 370106 non-null float64
3 installment 370106 non-null float64
4 grade 370106 non-null category
5 home_ownership 370106 non-null category
6 annual_inc 370106 non-null float64
7 verification_status 370106 non-null category
8 issue_d 370106 non-null datetime64[ns]
9 loan_status 370106 non-null category
10 purpose 370106 non-null category
11 dti 370106 non-null float64
12 earliest_cr_line 370106 non-null datetime64[ns]
13 open_acc 370106 non-null float64
14 revol_bal 370106 non-null float64
15 revol_util 370106 non-null float64
16 total_acc 370106 non-null float64
17 application_type 370106 non-null category
18 zip_code 370106 non-null category
19 any_mort 370106 non-null category
dtypes: category(9), datetime64[ns](2), float64(9)
memory usage: 37.1 MB
The date features will not have an impact on the loan status, so i will drop issue_d and earliest_cr_line
columns
0 const 24.91
1 loan_amnt 11.58
3 installment 11.04
6 open_acc 2.00
9 total_acc 1.87
7 revol_bal 1.76
4 annual_inc 1.62
8 revol_util 1.47
5 dti 1.40
2 int_rate 1.23
Insight
loan amount is highly correlated with installment which is also shown here by high VIF values. I will
drop installment
0 const 24.76
5 open_acc 2.00
8 total_acc 1.86
6 revol_bal 1.75
3 annual_inc 1.62
1 loan_amnt 1.49
7 revol_util 1.46
4 dti 1.40
2 int_rate 1.22
Insight
Based on the above VIF scores, I can conclude that there are no more multicolinear numerical features
I will drop installment from the dataframe
In [42]: X = final_df.drop(columns=['loan_status'])
y = final_df['loan_status']
In [45]: X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 370106 entries, 0 to 370105
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_amnt 370106 non-null float64
1 term 370106 non-null int8
2 int_rate 370106 non-null float64
3 grade 370106 non-null category
4 home_ownership 370106 non-null category
5 annual_inc 370106 non-null float64
6 verification_status 370106 non-null category
7 purpose 370106 non-null category
8 dti 370106 non-null float64
9 open_acc 370106 non-null float64
10 revol_bal 370106 non-null float64
11 revol_util 370106 non-null float64
12 total_acc 370106 non-null float64
13 application_type 370106 non-null category
14 zip_code 370106 non-null category
15 any_mort 370106 non-null int8
dtypes: category(6), float64(8), int8(2)
memory usage: 25.4 MB
Out[47]: loan_amnt term int_rate annual_inc dti open_acc revol_bal revol_util total_acc any_mort ... zip_code
5 rows × 53 columns
In [49]: X_train.head()
Out[49]: loan_amnt term int_rate annual_inc dti open_acc revol_bal revol_util total_acc any_mort ... zip
133405 27000.0 1 16.29 82302.0 25.52 13.0 12014.0 48.6 29.0 1 ...
365868 6000.0 0 18.55 45000.0 19.37 8.0 3219.0 73.2 11.0 0 ...
71124 8975.0 0 9.71 65000.0 7.98 10.0 3932.0 34.5 58.0 0 ...
33923 9600.0 0 6.62 58000.0 25.01 10.0 57236.0 36.3 19.0 1 ...
30512 18000.0 0 11.53 75000.0 8.50 9.0 9916.0 35.8 17.0 0 ...
5 rows × 53 columns
In [51]: X_train.head()
Out[51]: loan_amnt term int_rate annual_inc dti open_acc revol_bal revol_util total_acc any_mort ... zip_c
0 0.697828 1.0 0.506230 0.308276 0.372229 0.48 0.154818 0.394481 0.457627 1.0 ...
1 0.144832 0.0 0.610521 0.161417 0.282526 0.28 0.041481 0.594156 0.152542 0.0 ...
2 0.223173 0.0 0.202584 0.240157 0.116394 0.36 0.050669 0.280032 0.949153 0.0 ...
3 0.239631 0.0 0.059991 0.212598 0.364790 0.36 0.737568 0.294643 0.288136 1.0 ...
4 0.460829 0.0 0.286571 0.279528 0.123979 0.32 0.127782 0.290584 0.254237 0.0 ...
5 rows × 53 columns
In [52]: y_train.value_counts(normalize=True)*100
loan_status
Out[52]:
0 80.249186
1 19.750814
Name: proportion, dtype: float64
We can see a clear imbalance in the target class with 1 being ~20% and 0 being ~80%. Hence, I will use
SMOTE to fix this imbalance
In [53]: sm = SMOTE(random_state=0)
X_train, y_train = sm.fit_resample(X_train, y_train)
y_train.value_counts(normalize=True)*100
loan_status
Out[53]:
0 50.0
1 50.0
Name: proportion, dtype: float64
plt.figure(figsize=(8,8))
sns.barplot(data=feature_imp, y = 'Columns', x = 'Coefficients')
plt.title("Feature Importance for Model")
plt.yticks(fontsize=8)
plt.ylabel("Feature")
plt.tight_layout()
plt.show()
Insight
The features zip_code_29597, zip_code_05113, zip_code_00813, annual_inc and
application_type_joint have got high positive weightage and features zip_code_86630,
zip_code_11650, zip_code_93700, dti and open_acc have got high negative weightage indicating
their major contribution towards target variable
# Compute the false positive rate, true positive rate, and thresholds
fpr, tpr, thresholds = roc_curve(y_test, probs)
Insight
ROC curve illustrates the trade off between TPR(True Positive Rate) and FPR(False Positive Rate) for
various thresholds
The AU-ROC value of 0.91 signifies that the model is able to differenciate well between the two classes
Let us also look at PR Curve(Precision Recall Curve)
Insight
PR curve illustrates the trade off between Precision and Recall for various thresholds
The model has a AU-PRC value of 0.78 which is not that high. It is better than the random model which
has a AU-PRC value of 0.5.
This clearly indicates that we simply cannot conclude on the model's performance from just the ROC
curve.
In [63]: plt.figure()
plt.plot(thr,precision[0:len(thr)],label='precision',color='blue')
plt.plot(thr,recall[0:len(thr)],label='recall',color='orange')
intersection_thr = thr[np.where(precision == recall)[0][0]].round(4)
plt.axvline(intersection_thr, linestyle='--', color='red')
plt.text(intersection_thr, 0.01, str(intersection_thr), ha='left', color='red')
plt.title("Precision-recall curve at different thresholds")
plt.xlabel("Threshold values")
plt.ylabel("Precision and Recall values")
plt.legend(loc="upper right")
plt.grid()
plt.show()
In [64]: y_pred = model.predict_proba(X_test)[:,1]
threshold_considered = intersection_thr
y_pred_custom = (y_pred>threshold_considered).astype('int')
print(classification_report(y_test,y_pred_custom))
7. Insights
80% of the customers have fully paid their loan and 20% are defaulters
Loan amount and installment are highly correlated as it is obvious that high loan amount will have high
installment amount
Loan taken for short term, i.e. 3 years are most likely to be fully paid back
Most of the people have home ownership as mortgage
Suprisingly, loans which are not verified are more likely to be paid back
Loan taken as joint application type are more likely to be paid back
People with grade A are more likely to fully pay their loan
Loan taken for wedding are more likely to be paid back
People from zip code 00813, 05113 fully pay back their loans whereas people from zip code 11650,
86630, 93700 are all defaulters
The features zip_code_29597, zip_code_05113, zip_code_00813, annual_inc, application_type_joint,
zip_code_86630, zip_code_11650, zip_code_93700, dti, open_acc affected the model outcome
heavily
As per the ROC curve and AU-ROC value of 0.91, the model is able to differenciate well between the
defaulters and non-defaulters
As per the PRC and AU-PRC value of 0.97, the model is able to return accurate results as well as return
majority of all positive results(high recall)
8. Recommendation
The bank can provide more short term loans, i.e. for 3 years, without much risk
Provide more joint loans and scrutinize more individual and direct pay application types
Analyze carefully the loan applications of customers with grades D, E, F and G. Do not provide them
loans or provide smaller loans to these customers
Reduce the loan given for small bussiness or analyze their application in detail before giving out loan to
small bussiness
Do not provide loans to customers with zip code 11650, 86630, 93700